FlexAttention for Efficient High-Resolution Vision-Language Models

1UMass Amherst     2Princeton University     3South China University of Technology
4University of California, Los Angeles    5MIT-IBM Watson AI Lab

Overview

Concurrent vision-language models often struggle with perceiving details in high-resolution images. Recently, some models have improved by encoding high-resolution images and using all tokens to compute attention. However, this approach significantly increases computational cost. To address this issue, we propose FlexAttention, a flexible attention mechanism designed for efficient high-resolution vision-language models.

overview

The idea behind FlexAttention is simple: instead of using all tokens within a high-resolution image to compute attention, we propose using a subset of important high-resolution tokens, dynamically selected through the attention map. Specifically, the high-resolution input image is downsampled to produce a low-resolution version. This low-resolution image, similar to those used in other VLMs, is concatenated with text tokens and fed into the LLM. The attention map generated by the LLM's attention module is then used to select a subset of important high-resolution tokens.

selection

This selection process occurs within the Feature Selection Module, where regions corresponding to the highest attention values are selected, cropped from the high-resolution feature map, and sent to the Hierarchical Attention Module in the next layer. The Hierarchical Attention Module, which replaces the original self-attention module, computes the attention between the selected high-resolution tokens, low-resolution tokens, and text tokens. This allows the model to focus on important regions in the high-resolution image and efficiently obtain detailed information.


Attention Map Visualization

visualization

Question: What is the brand of this camera?
Answer: Dakota digital


visualization

Question: What is the number on the runner in middle?
Answer: 57859


visualization

Question: Who wrote this book?
Answer: Ray Kurzweil


Citation

@misc{li2024flexattention,
      title={FlexAttention for Efficient High-Resolution Vision-Language Models}, 
      author={Junyan Li and Delin Chen and Tianle Cai and Peihao Chen and Yining Hong and Zhenfang Chen and Yikang Shen and Chuang Gan},
      year={2024},
      eprint={2407.20228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.20228}, 
}