New Benchmark Tests Multi-Agent Egocentric Video Understanding
- •KAIST introduces MA-EgoQA benchmark for parallel egocentric video understanding across multiple embodied agents.
- •Benchmark features 1.7k questions covering social interaction, task coordination, and temporal reasoning categories.
- •EgoMAS baseline utilizes shared memory and dynamic retrieval to outperform Gemini-2.5-Flash and GPT-5.
As AI moves from static digital environments into the physical world, the future of collaboration will involve humans working alongside teams of embodied agents—robots that can perceive and act in real-time. To navigate these complex environments, these systems must process multiple streams of first-person (egocentric) video simultaneously. However, current models struggle to aggregate information from different perspectives into a cohesive system-level memory.
Researchers at KAIST AI have addressed this gap by introducing MA-EgoQA, a rigorous benchmark designed to evaluate how well AI can answer questions based on multiple video feeds from different agents. The dataset includes over 1,700 questions categorized into five critical domains, including theory-of-mind—the ability to understand the mental states of others—and task coordination. This requires the AI to track what each individual agent sees and does over long periods, then synthesize that data to solve complex queries.
To set a standard for this new challenge, the team developed EgoMAS. This model operates using a shared memory architecture, allowing all agents to contribute to a central pool of information. By employing agent-wise dynamic retrieval—a method of selectively pulling the most relevant data from specific agents based on the question—EgoMAS significantly outperformed frontier models. This research highlights that while single-agent vision is maturing, the next frontier lies in the collective intelligence of multi-agent systems.