EXPLAINABLE TOXICITY DETECTION IN TEXT USING TRANSFORMER MODEL AND TOKEN-LEVEL ATTRIBUTIONS

Authors

  • Deeksha MG, Grishel, Aravinda Prabhu S

DOI:

https://doi.org/10.25215/8194288770.18

Abstract

The proliferation of hate speech on online platforms calls for robust and transparent moderation tools. This paper introduces an explainable transformer-based framework for detecting and interpreting linguistic toxicity. We utilize a fine-tuned BERT architecture for multi-class classification to effectively make distinctions among toxic, offensive, and neutral contents. Importantly, the model integrates explainability using transformers-interpret to generate token-level attributions, visually highlighting specific terms that contribute to any such classification. This system is implemented as an interactive Streamlit web application that allows users to submit text or upload PDF documents for real-time analysis. These visual explanations are accompanied by probabilistic classifications through the interface for gaining insight into the results. A keyword-based reasoning component further enhances user trust and interpretability, complementing these deep learning attributions. Experimental evaluations on benchmark datasets such as the Hate Speech and Offensive Language Dataset establish the model's high accuracy, strong generalization, and balanced sensitivity across categories. This work presents a practical, user-centric tool for content moderation and contributes to the field of explainable NLP for the critical task of understanding digital toxicity.

Published

2026-03-11