Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.
from datasets import load_dataset
ds = load_dataset('mc4', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"URL: {ex['url'][:60]}")
print(f"Text: {ex['text'][:100]}...\n")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| text | string | Web-crawled Marathi text document |
| timestamp | string | Crawl timestamp of the document |
| url | string | Source URL of the crawled page |