FUSION-BASED FRAMEWORK FOR ROBUST AUDIO DEEPFAKE DETECTION

Authors

  • Sunny Dessai, Nebisha Muneesha, Gana K.V

DOI:

https://doi.org/10.25215/8194288797.35

Abstract

Recent advancements in deep learning have facilitated the generation of highly convincing synthetic audio, commonly known as deepfakes[1][2]. These pose increasing risks to data integrity and public trust. Although multiple detection techniques exist, many fail to adapt to novel and sophisticated generative models[1][10]. Addressing this gap, our study introduces a robust dual-stream deep learning framework for detecting audio deepfakes. We leverage feature fusion by simultaneously analyzing multiple audio representations[5][12]. Specifically, log-mel spectrograms are processed using a Convolutional Neural Network (CNN) to capture frequency textures, while Mel-Frequency Cepstral Coefficients (MFCCs) are fed into a Long Short-Term Memory (LSTM) network to model temporal variations[13][15]. By combining these features, the model uncovers subtle artifacts missed by single-feature approaches[5][8]. Our framework achieves 97.30% accuracy on a public benchmark dataset[3][4], demonstrating its high performance and superior adaptability against complex audio forgeries.

Published

2026-03-13