Building Cooperative Embodied Agents Modularly with Large Language Models

1 University of Massachusetts Amherst 2 Tsinghua University 3 Shanghai Jiao Tong University 4 MIT 5 MIT-IBM Watson AI Lab
ICLR 2024


In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoELA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation.


Here are several videos demonstrating our cooperative embodied agents built with Large Langauge Models who can think and communicate, on the ThreeDWorld Multi-Agent Transport and the Communicative Watch-And-Help environments.


Inspired by the cognitive architectures, we build CoELA, a Cooperative Embodied Language Agent with novel modular framework integrating the strong reasoning ability and language generation capability of LLMs. As shown in the following figure, CoELA consists of five key modules: (a) Perception, (b) Memory, (c) Communication, (d) Planning, and (e) Execution.

At each interaction step, CoELA first uses (a) Perception Module to perceive the raw sensory observation received from the environment, then updates the (b) Memory Module with extracted new information, which stores its knowledge and experience of the world and others.

CoELA tackles the challenge of efficient communication with a two-step method: first decide on what to send, then decide whether to send this message or choose another plan by deliberately using (c) The Communication Module to retrieve related information from (b) and utilize an LLM to generate the best message to send "in mind" beforehand, then leverages (d) the Planning Module driven by LLM with strong reasoning ability to make the decision on which plan to take given the related information retrieved from (b) and available actions proposed regarding the current state. The generated plan is then used to update (b2) the Episodic Memory.

Finally, (e) the Execution Module retrieves procedural knowledge stored in (b3) to turn the high-level plan into primitive actions executable in the environment.

An overview of CoELA. There are five key modules in our framework: (c) The Communication Module and (d) the Planning Module leverage LLMs to generate messages and make plans, (b) The Memory Module stores the agent's knowledge and experience about the world and others in semantic, episodic and procedural memory respectively, (a) The Perception Module and (e) the Execution Module interact directly with the external environment by perceiving raw observations and generating primitive actions.


To better understand the essential factors for effective cooperation, we conduct a qualitative analysis of the agents’ behaviors exhibited in our experiments and identified several cooperative behaviors.

Example cooperative behaviors demonstrate CoELA can communicate effectively and are good cooperators.