L3Cube-MahaCorpus - Awesome Marathi Datasets

L3Cube-MahaCorpus

MH Specific

Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.

Fine-tune a Marathi language model for your domain

Homepage GitHub

Quick Start

# Download from GitHub and load as text
with open("MahaCorpus.txt", "r") as f:
    lines = f.readlines()
print(f"{len(lines):,} sentences loaded")
print(lines[0][:100])

Modality

text

Size

24.8M sentences, 289M tokens

License

CC-BY-NC-SA-4.0

Format

text

Language

Update Frequency

static

Organization

L3Cube, Pune

Schema

Field	Type	Description
text	string	Marathi text sentence from the corpus

Build With This

Domain-specific Marathi LLM for legal or medical text understanding

Marathi text autocomplete for mobile keyboards

Content generation engine for Marathi marketing copy

AI Use Cases

Language model pretrainingWord embedding trainingMarathi text generationTransfer learning for downstream NLP tasks

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07