DEEP LEARNING-BASED VIDEO CAPTIONING
DOI:
https://doi.org/10.25215/8194288797.48Abstract
Making detailed text descriptions from videos is a difficult task that mixes computer vision with understanding human language. This study introduces a full system that changes video content into easy-to-understand natural language descriptions by using a mix of deep learning techniques. The system uses a pre-trained ResNet-50 model to get spatial features from chosen video frames and a Transformer-based decoder to create smooth captions. To show practical use, we built a web- based interface with Streamlit that allows for real-time video processing and caption creation. This design works well with both real video data and content made by computers, which helps solve problems that come from not having enough data when teaching deep learning models. Our evaluation confirms that this approach strikes a good balance between description accuracy and processing needs, producing relevant captions while running effectively on standard hardware.Published
2026-03-13
Issue
Section
Articles
