Constrained Human-AI Cooperation:
An Inclusive Embodied Social Intelligence Challenge

1 Carnegie Mellon University, 2 Peking University, 3 University of California, Santa Barbara, 4 Harvard University,
5 University of Massachusetts Amherst, 6 MIT, 7 Johns Hopkins University, 8 Honda Research Institute USA
NeurIPS Dataset and Benchmark Track 2024

Abstract

We introduce Constrained Human-AI Cooperation (CHAIC), an inclusive embodied social intelligence challenge designed to test social perception and cooperation in embodied agents. In CHAIC, the goal is for an embodied agent equipped with egocentric observations to assist a human who may be operating under physical constraints—e.g., unable to reach high places or confined to a wheelchair—in performing common household or outdoor tasks as efficiently as possible. To achieve this, a successful helper must: (1) infer the human's intents and constraints by following the human and observing their behaviors (social perception), and (2) make a cooperative plan tailored to the human user to solve the task as quickly as possible, working together as a team (cooperative planning).

To benchmark this challenge, we create four new agents with real physical constraints and eight long-horizon tasks featuring both indoor and outdoor scenes with various constraints, emergency events, and potential risks. We benchmark planning- and learning-based baselines on the challenge and introduce a new method that leverages Large Language Models and behavior modeling. Empirical evaluations demonstrate the effectiveness of our benchmark in enabling systematic assessment of key aspects of machine social intelligence.


Dataset Description

Figure 1. Overview of CHAIC Benchmark.

The Constrained Human-AI Cooperation (CHAIC) Challenge seeks to study how embodied agents perform on the social perception of human users with diverse physical constraints. we design and implement four new agents with real physical constraints, and eight tasks featuring both indoor and outdoor scenes including emergency events.

For each task, there is a constrained agent mimicking a human user with capability constraints trying to find and transport some target objects to a specific goal location, and a helper agent trying to infer the constrained agent's goal and capability constraints through active perception of the constrained agent's behaviors.

The four kinds of agents new are named child agent, wheelchair agent, bicycle agent, and frail agent with specific constraints, and their example video clips are shown here:


Child Agent

The child agent has a limited height and may fail to reach high locations. Meanwhile, it may break fragile objects.

Succeed in pick-up.

Fail in pick-up.

Break fragile objects.


Wheelchair Agent

The wheelchair agent cannot go through obstacles, and the helper agent needs to remove the obstacles for the wheelchair agent to pass. Meanwhile, the wheelchair agent may fail to reach objects which is too low or too high.

Wait for the helper to remove the obstacle.

Fail in pick-up.

Succeed in pick-up.


Bicycle Agent with Child

The bicycle agent is slow to move and act. Sometimes its child may run away and the helper needs to catch it.

The bicycle agent walks more slowly than the helper.

The bicycle agent has the basket on the bicycle as a container.

The helper helps the bicycle agent to catch the child agent.


Frail Agent

The frail agent may fail to pick up heavy objects, the heavier the object, the more likely the pick-up action fails. The helper agent can pick up objects with the frail agent together.

Two agents pick up together.

Fail in pick-up.

Succeed in pick-up.


Tasks

The eight kinds of tasks contain both indoor and outdoor scenes, and the task names are: No constraint, Low target, Obstacle, High target, High goal location, High container, Shopping, and Moving house. Here is their brief description:

Table 1. Tasks with constrained agents, including both indoor and outdoor scenes and rich features.


LLM+BM Helper Baseline

Figure 2. Our proposed baseline helper.

We test six types of helpers: Random Helper, Rule-based Hierarchical Plan Helper (RHP), LLM+BM Helper, VLM Helper, RL Helper and SmartHelp Helper. LLM+BM Helper achieves the best performance in our benchmark. Figure 2 is an overview of the LLM+BM Helper, which is equipped with specific modules for Perception, Behavior Modeling, Decision, and Execution. (1) The perception module detects objects from raw RGB images; (2) the memory module builds the semantic map of the environment and records behaviors; (3) the behavior modeling module recognizes the action of the partner and localizes the object corresponding to the action; (4) the decision module decides plans for the next steps using foundation models; and (5) the execution module generates low-level actions.

The following video is a demonstration of the LLM+BM Helper's mechanism:

Qualitative Example

Combining the wide knowledge from foundation models and precise perception from fine-tuned detection models, the LLM+BM Helper can infer the constrained agent's goal and constraints accurately and efficiently. Here is an exmaple of the LLM+BM Helper's behavior:

Figure 3. LLM+BM Helper's thought and behavior.