1 UCLA
2 SCUT
3 Umass Amherst
4 MIT-IBM Watson AI Lab
In this work, we enpower the LLM with multisensory perception and active observation abiltiy. We propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and perceptions. MultiPLY can perform a diverse set of multisensory embodied tasks, including multisensory question answering, embodied question answering, task decomposition, object retrieval, and tool use.
(enable audio in the control bar to hear the sound)
Impact Sound |
Ambient Sound |
Visual Perception |
Tactile Sense |
Thermal Sense |
Navigation |
Tool Use |
Multisensory Captioning |
Question Answering |
Object Retrieval |
Task Decomposition |
Rearrangement |
If you use this work or find it helpful, please consider citing: (bibtex)
@misc{hong2024multiply, title={MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World}, author={Yining Hong and Zishuo Zheng and Peihao Chen and Yian Wang and Junyan Li and Chuang Gan}, year={2024}, eprint={2401.08577}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Thanks to Justin Kerr for the website template.