HASOC Marathi Offensive Language Dataset

HASOC Marathi Offensive Language Dataset

MH Specific

Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.

Build an automated content moderation system for Marathi social media platforms to flag offensive language in real-time.
HomepagePaper

Quick Start

# Download from https://hasocfire.github.io/hasoc/
import pandas as pd
df = pd.read_csv('hasoc_marathi.csv')
print(f"Total samples: {len(df)}")
print(df['label'].value_counts())
Modality
text
Size
~4,970 annotated tweets (1,874 from 2021 + 3,096 from 2022)
License
Format
CSV
Language
mr
Update Frequency
static
Organization
HASOC / FIRE

Schema

FieldTypeDescription
textstringSocial media text in Marathi
labelstringClassification label (hate, offensive, profane, none)

Build With This

Create a toxicity severity scorer for Marathi text that ranks offensive content by intensity for prioritized human review
Develop a counter-speech generator that suggests constructive Marathi responses to hateful messages
Build an election-period hate speech monitor for Maharashtra that tracks spikes in offensive language across social media

AI Use Cases

Marathi offensive content moderationHate speech detection with OLID taxonomySocial media content filteringOnline safety monitoring
Last verified: 2026-03-09