Marathi subset of the massive 20.9 billion token multilingual corpus covering 24 Indic languages, collected from diverse web sources for language model pretraining.
from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicCorpV2', 'mr', streaming=True)
for i, ex in enumerate(ds['train']):
print(f"Text: {ex['text'][:150]}...")
if i >= 2: break| Field | Type | Description |
|---|---|---|
| text | string | Marathi text document from the web-crawled corpus |