IMAGE SUMMARIZATION USING FASTVLM

Authors

  • Puneeth N Ail, Renuka S, Dr. Hemalatha N, Dr Rakesh Kumar

DOI:

https://doi.org/10.25215/8194288797.23

Abstract

Scaling the resolution of input images plays a crucial role in improving the efficiency and accuracy of Vision‑Language Models (VLMs), especially when dealing with complex, text‑dense visual scenes. Nevertheless, conventional visual encoders—such as Vision Transformers (ViTs)—struggle to maintain computational efficiency at high resolutions due to the exponential increase in visual tokens and the latency introduced by deep self‑attention stacks. To address these challenges, this study explores resolution‑adaptive optimization in VLMs across two key dimensions: reducing vision‑side encoding latency and limiting the number of visual tokens transmitted to the language model, thereby improving overall system responsiveness. Through a detailed empirical analysis examining the interaction among image resolution, encoding speed, token density, and LLM size, we used FastVLM, an efficiency‑driven model that balances inference cost and predictive performance. FastVLM integrates FastViTHD, a hybrid vision encoder engineered to generate a compact token representation while ensuring high perceptual fidelity for large‑scale images. Unlike prior strategies requiring explicit token pruning or complex architectural modifications, FastVLM achieves optimal efficiency solely through scalable input resolution, offering a simpler yet highly effective design for real‑time visual reasoning.

Published

2026-03-13