First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.
# Download from https://github.com/l3cube-pune/MarathiNLP
import pandas as pd
df = pd.read_csv('MECorpus.csv')
print(f"Code-mixed samples: {len(df)}")
for _, row in df.head(5).iterrows():
print(f"[{row['label']}] {row['text'][:80]}...")| Field | Type | Description |
|---|---|---|
| text | string | Code-mixed Marathi-English text |
| label | string | Sentiment or task-specific label |