L3Cube MeCorpus (Marathi-English Code-Mixed)

L3Cube MeCorpus (Marathi-English Code-Mixed)

MH Specific

First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.

Build a code-mixed Marathi-English sentiment analyzer for social media where users frequently mix both languages.
HomepageGitHubPaper

Quick Start

# Download from https://github.com/l3cube-pune/MarathiNLP
import pandas as pd
df = pd.read_csv('MECorpus.csv')
print(f"Code-mixed samples: {len(df)}")
for _, row in df.head(5).iterrows():
    print(f"[{row['label']}] {row['text'][:80]}...")
Modality
text
Size
10M unsupervised + ~36K supervised sentences
License
Format
CSV / JSON
Language
mr, en
Update Frequency
static
Organization
L3Cube, Pune

Schema

FieldTypeDescription
textstringCode-mixed Marathi-English text
labelstringSentiment or task-specific label

Build With This

Create a language identification system that tags each word in code-mixed text as Marathi, English, or named entity
Develop a code-mixed text normalizer that converts informal mixed-language social media text into standard Marathi
Build a sentiment-aware chatbot for Maharashtra e-commerce platforms that understands code-mixed customer queries

AI Use Cases

Code-mixed Marathi-English sentiment analysisBilingual social media monitoringScript-mixed language identificationCode-mixed hate speech detection
Last verified: 2026-03-09