Composing Visual Entities and Relationships in
Large Language Models Via Communicative Decoding

1UMass Amherst    2Wuhan University    3University of California, Los Angeles
4MIT-IBM Watson AI Lab    5South China University of Technology


We introduce CoVLM, a novel vision-language model able to compose visual entities and relationships via communicative decoding.

CoVLM is specifically designed to guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by LLM to inform the detection network to propose relevant regions that LLM should pay attention to. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions.

We hope CoVLM inspires more advancements in compositional reasoning ability for LLM, making machine more intelligent in broader applications.

Qualitative Results