A Novel Bengali Spam Comment Dataset with Transformer Based Classification for Social Media Content Moderation

Accepted at : 28th International Conference on Computer and Information Technology (28th ICCIT 2025).

Abstract : Although more than 230 million people speak the Bengali language globally, automatic spam research in Bengali has been grossly neglected so far and consequently, Bengali speaking societies are being exposed to fraudulent relationships and scam attempts on social media platforms. This paper presents the first large scale Bangla spam comment dataset, filling a major gap for low resource language natural language processing. We present a dataset of 9,000 balanced Bengali comments systematically collected from various social media sources, including popular Facebook pages and YouTube channels across news, entertainment, sports, technology, and lifestyle domains. Our work includes a specialized preprocessing pipeline with culturally aware emoji-to-text conversion, BNLTK-based normalization, and semantic noise filtering which is crucial for spam classification. Our dataset marks itself with high annotation quality of inter-annotator agreement 0.932 under the strict three stage annotation protocol. We compare against four contemporary transformer models, and show that BanglaBERT obtains significant improvement with 94% accuracy and 0.95 F1-score. Statistical validation using McNemar’s test demonstrates the significance of language-specific pretraining benefits. To the best of our knowledge, we have created the first standardized benchmark for Bengali spam classification and provide necessary resources to aid further research, thereby strengthening security in regional language and developing better content moderation systems for the Bengali social media ecosystem.

Scroll to Top