MultiPLY

A Multisensory Object-Centric Embodied Large Language Model in 3D World

1 UCLA       2 SCUT       3 Umass Amherst       4 MIT-IBM Watson AI Lab      



Overview

In this work, we enpower the LLM with multisensory perception and active observation abiltiy. We propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and perceptions. MultiPLY can perform a diverse set of multisensory embodied tasks, including multisensory question answering, embodied question answering, task decomposition, object retrieval, and tool use.


How to enpower LLMs with embodied multisensory perception ability?

pipeline


Demos

(enable audio in the control bar to hear the sound)

Impact Sound

Ambient Sound

Visual Perception

Tactile Sense

Thermal Sense

Navigation

Tool Use

Multisensory Captioning

Question Answering

Object Retrieval

Task Decomposition

Rearrangement



Citation

If you use this work or find it helpful, please consider citing: (bibtex)

@misc{hong2024multiply,
      title={MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World}, 
      author={Yining Hong and Zishuo Zheng and Peihao Chen and Yian Wang and Junyan Li and Chuang Gan},
      year={2024},
      eprint={2401.08577},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
} 


Thanks to Justin Kerr for the website template.