MECoBench

Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a multimodal embodied cooperation benchmark with an evaluation platform spanning diverse real-world tasks, two cooperation structures, and three collaboration modes. Through extensive experiments across various MLLMs, we summarize three key findings: (i) Collaboration generally improves embodied task completion, but its benefits depend on balancing collaborative gains against coordination complexity. (ii) Communication is essential to collaboration gains, while the best collaboration mode depends on team size and model capability. (iii) Collaboration improves robustness under noisy priors and exploration conditions. Overall, MECoBench provides a systematic testbed for understanding the mechanisms and limits of multimodal embodied collaboration.

Multi-agent collaboration vs single-agent in embodied environment

Multi-agent collaboration significantly improves efficiency compared to single-agent execution in household tasks.

MECoBench

MECoBench is built on VirtualHome, a realistic household simulator. It covers 8 everyday task types, two cooperation structures—parallel (agents freely access all rooms) and sequential (agents operate in disjoint zones and must transfer objects across boundaries)—and three collaboration modes: isolated, decentralized, and centralized.

192

Test Cases

Unique Tasks

Task Types

Cooperation Structures

Collaboration Modes

14+

Models Evaluated

Each task is grounded through a two-stage pipeline: scene instantiation randomly places objects at legal locations, then the collaboration structure is assigned. In the sequential setting, rooms are partitioned into disjoint zones, requiring inter-agent handovers.

Data construction pipeline. Tasks are grounded from high-level templates into concrete scenes, then configured for parallel or sequential collaboration.

Agents follow an iterative observe–communicate–act loop. Each agent receives a panoramic visual observation, generates a communication message based on its history and memory, then selects an action conditioned on the communication context.

Evaluation workflow. Agents perceive, communicate, reason over history and memory, and execute actions with feedback.

Experiments

Will multi-agent collaboration benefit embodied task completion?

Do models know how to collaborate?

Most models demonstrate basic collaborative capability: stronger individual performance generally translates into better collaboration, and adding a second agent improves performance for most models.

Parallel single-agent vs. sequential two-agent SR.

SR and CR change from 1→2 agents in parallel tasks.

Do more agents always help?

Performance follows an inverted-U curve: adding agents helps up to a point, then coordination overhead degrades results. Moderate team sizes strike the best balance. Larger teams remain more robust on high-complexity tasks even as absolute success rates decline.

Team size scaling. (a) Inverted-U performance curve across 1–5 agents. (b) Larger teams hold up better as task complexity grows.

What makes multi-agent collaboration effective?

Which collaboration mode works best?

Leader-based (centralized) coordination achieves the best performance in two-agent settings, helping teams allocate tasks and avoid spatial conflicts. As team size grows, decentralized coordination becomes more competitive by avoiding the leader bottleneck. The best mode depends on both team size and model capability.

Collaboration modes in 2-agent setup (a) and scaling to larger teams (b).

Is communication necessary?

Removing communication consistently degrades performance, with much larger drops on sequential tasks. The gap widens with more agents. Communication is essential, especially under tight coordination requirements or larger teams.

Ablation on communication. Removing it hurts more on sequential tasks and larger teams.

Is textual communication enough?

Shared memory improves AUC and reduces token cost vs. broadcast, with faster early-stage progress—effective for loosely coupled parallel tasks. Vision-augmented leadership improves efficiency for capable leaders, but weaker models may drop on final SR.

Shared memory vs. broadcast: relative gains in effectiveness and efficiency (a), AUC progress over steps (b).

Leader-based collaboration with vs. without visual augmentation.

Is multi-agent collaboration robust?

Does collaboration remain effective under imperfect information?

When object locations are missing or corrupted by noise, agents must rely more heavily on visual search, communication, and distributed exploration. Multi-agent collaboration remains effective under imperfect information: adding a second agent generally improves success rates, while larger teams exhibit a similar inverted-U scaling trend. Notably, collaboration yields the largest relative gains under noisy priors, suggesting that agents can compensate for misleading information through communication and complementary exploration. Overall, multi-agent collaboration is robust to both insufficient and noisy task information.

SR change from 1→2 agents without location priors. Collaboration still helps across models.

Model	W/ Noise	W/o Noise	W/o Loc.
GPT-5-mini	+8.22%76.04→82.29	+1.16%90.62→91.67	+6.47%32.29→34.38
Qwen3-32B	+28.83%54.17→69.79	+28.57%58.33→75.00	+18.88%21.88→26.04
Qwen3.5-9B	+4.01%78.12→81.25	+3.52%85.42→88.54	+0.00%29.17→29.17

Relative SR gain (1→2 agents) under noisy priors, clean priors, and no location priors. Gains are highest under noisy conditions.

When does collaboration break down?

Multi-agent conflicts — duplicate grabs, redundant exploration — worsen with larger teams. For Qwen3-32B-VL, scaling to five agents drives duplicate-grab rate to 57.4%, partly explaining the inverted-U performance drop.

Hallucinated beliefs propagate across agents. When one agent falsely concludes the task is complete and broadcasts it, the team stops prematurely. Though rare (1–1.5% of cases), such events cause SR drops of up to 73.7%, making them disproportionately destructive.