Google's manually audited 3 trillion token monolingual dataset from CommonCrawl covering 419 languages. Document-level with language identification and quality auditing. Manual auditing ensures higher Marathi data quality than fully automated pipelines.
from datasets import load_dataset
ds = load_dataset('allenai/MADLAD-400', 'mr', split='clean', streaming=True)
for i, ex in enumerate(ds):
print(f"Text: {ex['text'][:100]}...")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| text | string | Marathi web-crawled text document |