OSCAR 23.01 (mr)

OSCAR 23.01 (mr)

MH Specific

OSCAR 23.01 (mr) dataset for language nlp.

Build a quality-filtered Marathi pre-training corpus from OSCAR with document-level quality scoring.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('oscar-corpus/OSCAR-2301', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break
Modality
text
Size
729,578 documents, 252M words, 4.5 GB
License
Format
CSV/JSON
Language
mr
Update Frequency
static
Organization
OSCAR Project / Inria

Schema

FieldTypeDescription
textstringWeb-crawled Marathi text content
metaobjectMetadata including quality scores and word count

Build With This

Create a Marathi perplexity-based data filter that selects the most linguistically coherent OSCAR documents
Develop a Marathi web content timeline analyzer tracking how online Marathi content evolved over time
Build an n-gram language model from OSCAR Marathi as a baseline for evaluating neural language models

AI Use Cases

Language model pretraining
Last verified: 2026-03-07