VIDEO CAPTIONING USING TRANSFORMERS AND BERT

Authors

  • K. Annapoorneshwari Shetty, Shravya, Ishika Amin

DOI:

https://doi.org/10.25215/8194288797.11

Abstract

Video captioning plays a crucial role at the crossroads of visual computing and language understanding, designed to produce natural language narratives for video content. Traditional methods often rely on convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which may struggle with handling long-range dependencies and temporal dynamics. This paper introduces a new method that utilizes Transformer architectures for video feature extraction and the generation of captions, optimized by BERT for better understanding of semantics. Our approach utilizes a CLIP framework to derive spatial characteristics from video frames, which are then analyzed using a Transformer-based encoder-decoder structure. BERT is used to refine the produced captions, ensuring they are grammatically correct and contextually appropriate. The outcomes of our experiments conducted on established datasets demonstrate the effectiveness of our approach, attaining top results in relation to BLEU, ROUGE, and CIDEr metrics.

Published

2026-03-13