Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.
# Download from https://hasocfire.github.io/hasoc/
import pandas as pd
df = pd.read_csv('hasoc_marathi.csv')
print(f"Total samples: {len(df)}")
print(df['label'].value_counts())| Field | Type | Description |
|---|---|---|
| text | string | Social media text in Marathi |
| label | string | Classification label (hate, offensive, profane, none) |