OSCAR 23.01 (Marathi)
MH SpecificMarathi subset of the OSCAR 23.01 web-crawled corpus containing 729,578 documents and 252 million words (4.5 GB), derived from Common Crawl with language filtering.
Build a Marathi text corpus quality pipeline that deduplicates and filters OSCAR data to create a clean pre-training dataset.
Quick Start
from datasets import load_dataset
ds = load_dataset('oscar-corpus/OSCAR-2301', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Text: {ex['text'][:100]}...")
print(f"Meta: {ex['meta']}\n")
if i >= 4: break
Size
729,578 documents, 252M words, 4.5 GB
Organization
OSCAR Project (Inria / DFKI)
Schema
| Field | Type | Description |
|---|
| text | string | Web-crawled Marathi text content |
| meta | object | Metadata including quality scores, sentence count, and word count |
Build With This
Create a Marathi Wikipedia content gap detector that compares OSCAR web content topics against existing Wikipedia coverage
Develop a Marathi writing style analyzer trained on web text to identify formal, informal, and news writing styles
Build a Marathi corpus for training word embeddings and evaluate them on analogy and similarity benchmarks
AI Use Cases
Language model pretrainingWeb text analysisText classificationCorpus linguistics research
Last verified: 2026-03-07