Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.
from datasets import load_dataset
ds = load_dataset('ai4bharat/sangraha', 'verified.mar', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Text: {ex['text'][:100]}...")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| text | string | Curated Marathi web text |
| source | string | Source of the text |