* Equal contribution † Corresponding authors
Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a multimodal embodied cooperation benchmark with an evaluation platform spanning diverse real-world tasks, two cooperation structures, and three collaboration modes. Through extensive experiments across various MLLMs, we summarize three key findings: (i) Collaboration generally improves embodied task completion, but its benefits depend on balancing collaborative gains against coordination complexity. (ii) Communication is essential to collaboration gains, while the best collaboration mode depends on team size and model capability. (iii) Collaboration improves robustness under noisy priors and exploration conditions. Overall, MECoBench provides a systematic testbed for understanding the mechanisms and limits of multimodal embodied collaboration.
Multi-agent collaboration significantly improves efficiency compared to single-agent execution in household tasks.
MECoBench is built on VirtualHome, a realistic household simulator. It covers 8 everyday task types, two cooperation structures—parallel (agents freely access all rooms) and sequential (agents operate in disjoint zones and must transfer objects across boundaries)—and three collaboration modes: isolated, decentralized, and centralized.
Each task is grounded through a two-stage pipeline: scene instantiation randomly places objects at legal locations, then the collaboration structure is assigned. In the sequential setting, rooms are partitioned into disjoint zones, requiring inter-agent handovers.
Agents follow an iterative observe–communicate–act loop. Each agent receives a panoramic visual observation, generates a communication message based on its history and memory, then selects an action conditioned on the communication context.
| Model | W/ Noise | W/o Noise | W/o Loc. |
|---|---|---|---|
| GPT-5-mini | +8.22%76.04→82.29 | +1.16%90.62→91.67 | +6.47%32.29→34.38 |
| Qwen3-32B | +28.83%54.17→69.79 | +28.57%58.33→75.00 | +18.88%21.88→26.04 |
| Qwen3.5-9B | +4.01%78.12→81.25 | +3.52%85.42→88.54 | +0.00%29.17→29.17 |
Relative SR gain (1→2 agents) under noisy priors, clean priors, and no location priors. Gains are highest under noisy conditions.
Multi-agent conflicts — duplicate grabs, redundant exploration — worsen with larger teams. For Qwen3-32B-VL, scaling to five agents drives duplicate-grab rate to 57.4%, partly explaining the inverted-U performance drop.
Hallucinated beliefs propagate across agents. When one agent falsely concludes the task is complete and broadcasts it, the team stops prematurely. Though rare (1–1.5% of cases), such events cause SR drops of up to 73.7%, making them disproportionately destructive.
@misc{liu2026mecobenchsystematicstudymultimodal,
title={MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments},
author={Qingyun Liu and Jiwen Zhang and Jingyi Hu and Siyuan Wang and Zhongyu Wei},
year={2026},
eprint={2606.31966},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2606.31966},
}