Roman Urdu Multi-Class Offensive Text Detection using Hybrid Features and SVM

EasyChair Preprint 4810

5 pages•Date: December 25, 2020

Tauqeer Sajid, Mehdi Hassan, Mohsan Ali and Rabia Gillani

Abstract

Hate content has become a significant issue worldwide due to the increase in social networking sites. Detection of hate content from a language other than English is challenging. We propose a new technique that automatically detects the Roman Urdu comments from YouTube videos into five classes. These classes, including, Religious Hate, Violence Promotion, Extremist (Racist), Threat/Fear, and Neutral. We have generated dataset by scrapping Roman Urdu comments from YouTube videos and labeled by the language experts. We have considered N-grams and TF-IDF values for feature extraction followed by SVM classification. Some classes have relatively less instances, and we employed SMOTE for class-balancing. The developed model offers a high classification performance of 77.45% using the 10-Fold cross-validation technique. The proposed approach offers superior classification results as compared to others.

Keyphrases: Roman Urdu, TF-ID, Tri-gram, Uni-gram, YouTube, deep learning, forensic lab air, hate speech, machine learning, n-gram, religious hate, roman urdu data, uni gram bi, violence promotion

Links:

https://easychair.org/publications/preprint/6xvf

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:4810,
  author    = {Tauqeer Sajid and Mehdi Hassan and Mohsan Ali and Rabia Gillani},
  title     = {Roman Urdu Multi-Class Offensive Text Detection using Hybrid Features and SVM},
  howpublished = {EasyChair Preprint 4810},
  year      = {EasyChair, 2020}}

Download PDF Open PDF in browser