Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.
# Download from GitHub and load as text
with open("MahaCorpus.txt", "r") as f:
lines = f.readlines()
print(f"{len(lines):,} sentences loaded")
print(lines[0][:100])| Field | Type | Description |
|---|---|---|
| text | string | Marathi text sentence from the corpus |