Sangraha Marathi Web Corpus

Sangraha Marathi Web Corpus

Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.

Use Sangraha's curated Marathi web corpus for training a high-quality Marathi language model with better data quality than raw crawls.

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/sangraha', 'verified.mar', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break
Modality
text
Size
Large-scale web corpus; Marathi subset with millions of documents
License
Format
JSONL / Parquet
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
textstringCurated Marathi web text
sourcestringSource of the text

Build With This

Create a comparative study of Marathi LLM pre-training quality using Sangraha vs raw web crawl data
Develop a Marathi writing quality benchmark using Sangraha's curated text as reference for text generation evaluation
Build a Marathi content filtering pipeline that replicates Sangraha's curation approach for new web crawl data

AI Use Cases

Marathi LLM pre-training and continued pre-trainingDomain-specific fine-tuning data extractionWeb content quality analysisMarathi text generation
Last verified: 2026-03-09