L3Cube-MahaCorpus

L3Cube-MahaCorpus

MH Specific

Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.

Fine-tune a Marathi language model for your domain
HomepageGitHub

Quick Start

# Download from GitHub and load as text
with open("MahaCorpus.txt", "r") as f:
    lines = f.readlines()
print(f"{len(lines):,} sentences loaded")
print(lines[0][:100])
Modality
text
Size
24.8M sentences, 289M tokens
License
Format
text
Language
mr
Update Frequency
static
Organization
L3Cube, Pune

Schema

FieldTypeDescription
textstringMarathi text sentence from the corpus

Build With This

Domain-specific Marathi LLM for legal or medical text understanding
Marathi text autocomplete for mobile keyboards
Content generation engine for Marathi marketing copy

AI Use Cases

Language model pretrainingWord embedding trainingMarathi text generationTransfer learning for downstream NLP tasks
Last verified: 2026-03-07