CulturaX Marathi

CulturaX Marathi

Large-scale multilingual web corpus with a substantial Marathi subset, combining mC4 and OSCAR data with extensive cleaning and deduplication. Provides high-quality pre-training data for Marathi language models with better filtering than raw web crawls.

Build a high-quality Marathi pre-training corpus by combining CulturaX with quality filtering for LLM training.

Quick Start

from datasets import load_dataset
ds = load_dataset('uonlp/CulturaX', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"URL: {ex['url'][:60]}")
    print(f"Text: {ex['text'][:80]}...\n")
    if i >= 4: break
Modality
text
Size
Billions of tokens across 167 languages; significant Marathi subset
License
Format
Parquet / JSONL
Language
mr
Update Frequency
static
Organization
UoNLP (University of Oregon NLP)

Schema

FieldTypeDescription
textstringCleaned Marathi web text
urlstringSource URL
timestampstringCrawl timestamp

Build With This

Create an automated Marathi content curation pipeline using CulturaX quality signals to build domain-specific corpora
Develop a temporal analysis of Marathi web content to track language evolution and new vocabulary emergence
Build a Marathi web content classifier that identifies educational, news, government, and commercial domains

AI Use Cases

Marathi language model pre-trainingText quality filtering researchWeb content analysisDomain-specific corpus extraction
Last verified: 2026-03-09