CoVLM is specifically designed to guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by LLM to inform the detection network to propose relevant regions that LLM should pay attention to. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions.
We hope CoVLM inspires more advancements in compositional reasoning ability for LLM, making machine more intelligent in broader applications.