OSCAR 23.01 (mr) dataset for language nlp.
from datasets import load_dataset
ds = load_dataset('oscar-corpus/OSCAR-2301', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Text: {ex['text'][:100]}...")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| text | string | Web-crawled Marathi text content |
| meta | object | Metadata including quality scores and word count |