L3Cube MeCorpus (Marathi-English Code-Mixed)

MH Specific

First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.

Build a code-mixed Marathi-English sentiment analyzer for social media where users frequently mix both languages.

Homepage GitHub Paper

Quick Start

# Download from https://github.com/l3cube-pune/MarathiNLP
import pandas as pd
df = pd.read_csv('MECorpus.csv')
print(f"Code-mixed samples: {len(df)}")
for _, row in df.head(5).iterrows():
    print(f"[{row['label']}] {row['text'][:80]}...")

Modality

text

Size

10M unsupervised + ~36K supervised sentences

License

CC-BY-NC-SA-4.0

Format

CSV / JSON

Language

mr, en

Update Frequency

static

Organization

L3Cube, Pune

Schema

Field	Type	Description
text	string	Code-mixed Marathi-English text
label	string	Sentiment or task-specific label

Build With This

Create a language identification system that tags each word in code-mixed text as Marathi, English, or named entity

Develop a code-mixed text normalizer that converts informal mixed-language social media text into standard Marathi

Build a sentiment-aware chatbot for Maharashtra e-commerce platforms that understands code-mixed customer queries

AI Use Cases

Code-mixed Marathi-English sentiment analysisBilingual social media monitoringScript-mixed language identificationCode-mixed hate speech detection

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09