Large-scale multilingual web corpus with a substantial Marathi subset, combining mC4 and OSCAR data with extensive cleaning and deduplication. Provides high-quality pre-training data for Marathi language models with better filtering than raw web crawls.
from datasets import load_dataset
ds = load_dataset('uonlp/CulturaX', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"URL: {ex['url'][:60]}")
print(f"Text: {ex['text'][:80]}...\n")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| text | string | Cleaned Marathi web text |
| url | string | Source URL |
| timestamp | string | Crawl timestamp |