FUSION-BASED FRAMEWORK FOR ROBUST AUDIO DEEPFAKE DETECTION
DOI:
https://doi.org/10.25215/8194288797.35Abstract
Recent advancements in deep learning have facilitated the generation of highly convincing synthetic audio, commonly known as deepfakes[1][2]. These pose increasing risks to data integrity and public trust. Although multiple detection techniques exist, many fail to adapt to novel and sophisticated generative models[1][10]. Addressing this gap, our study introduces a robust dual-stream deep learning framework for detecting audio deepfakes. We leverage feature fusion by simultaneously analyzing multiple audio representations[5][12]. Specifically, log-mel spectrograms are processed using a Convolutional Neural Network (CNN) to capture frequency textures, while Mel-Frequency Cepstral Coefficients (MFCCs) are fed into a Long Short-Term Memory (LSTM) network to model temporal variations[13][15]. By combining these features, the model uncovers subtle artifacts missed by single-feature approaches[5][8]. Our framework achieves 97.30% accuracy on a public benchmark dataset[3][4], demonstrating its high performance and superior adaptability against complex audio forgeries.Published
2026-03-13
Issue
Section
Articles
