HASOC Marathi Offensive Language Dataset

MH Specific

Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.

Build an automated content moderation system for Marathi social media platforms to flag offensive language in real-time.

Homepage Paper

Quick Start

# Download from https://hasocfire.github.io/hasoc/
import pandas as pd
df = pd.read_csv('hasoc_marathi.csv')
print(f"Total samples: {len(df)}")
print(df['label'].value_counts())

Modality

text

Size

~4,970 annotated tweets (1,874 from 2021 + 3,096 from 2022)

License

Research

Format

CSV

Language

Update Frequency

static

Organization

HASOC / FIRE

Schema

Field	Type	Description
text	string	Social media text in Marathi
label	string	Classification label (hate, offensive, profane, none)

Build With This

Create a toxicity severity scorer for Marathi text that ranks offensive content by intensity for prioritized human review

Develop a counter-speech generator that suggests constructive Marathi responses to hateful messages

Build an election-period hate speech monitor for Maharashtra that tracks spikes in offensive language across social media

AI Use Cases

Marathi offensive content moderationHate speech detection with OLID taxonomySocial media content filteringOnline safety monitoring

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09