OSCAR 23.01 (Marathi)

OSCAR 23.01 (Marathi)

MH Specific

Marathi subset of the OSCAR 23.01 web-crawled corpus containing 729,578 documents and 252 million words (4.5 GB), derived from Common Crawl with language filtering.

Build a Marathi text corpus quality pipeline that deduplicates and filters OSCAR data to create a clean pre-training dataset.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('oscar-corpus/OSCAR-2301', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    print(f"Meta: {ex['meta']}\n")
    if i >= 4: break
Modality
text
Size
729,578 documents, 252M words, 4.5 GB
License
Format
Parquet
Language
mr
Update Frequency
static
Organization
OSCAR Project (Inria / DFKI)

Schema

FieldTypeDescription
textstringWeb-crawled Marathi text content
metaobjectMetadata including quality scores, sentence count, and word count

Build With This

Create a Marathi Wikipedia content gap detector that compares OSCAR web content topics against existing Wikipedia coverage
Develop a Marathi writing style analyzer trained on web text to identify formal, informal, and news writing styles
Build a Marathi corpus for training word embeddings and evaluate them on analogy and similarity benchmarks

AI Use Cases

Language model pretrainingWeb text analysisText classificationCorpus linguistics research
Last verified: 2026-03-07