MADLAD-400 Marathi

MADLAD-400 Marathi

Google's manually audited 3 trillion token monolingual dataset from CommonCrawl covering 419 languages. Document-level with language identification and quality auditing. Manual auditing ensures higher Marathi data quality than fully automated pipelines.

Use MADLAD-400 Marathi data for large-scale language model pre-training with quality-filtered web text.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('allenai/MADLAD-400', 'mr', split='clean', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break
Modality
text
Size
3 trillion tokens total across 419 languages; significant Marathi subset
License
Format
Parquet
Language
mr
Update Frequency
static
Organization
Google / Allen AI

Schema

FieldTypeDescription
textstringMarathi web-crawled text document

Build With This

Create a Marathi corpus deduplication pipeline comparing MADLAD, CC-100, and OSCAR to build the cleanest combined pre-training dataset
Develop a Marathi text quality classifier trained to distinguish high-quality from noisy web text for corpus curation
Build a domain distribution analyzer that maps the topical coverage of MADLAD Marathi data compared to other web corpora

AI Use Cases

Marathi language model pre-trainingWeb content quality analysisMonolingual text generationDocument-level language modeling
Last verified: 2026-03-09