Sangraha Marathi Web Corpus

Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.

Use Sangraha's curated Marathi web corpus for training a high-quality Marathi language model with better data quality than raw crawls.

Homepage HuggingFace Paper

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/sangraha', 'verified.mar', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break

Modality

text

Size

Large-scale web corpus; Marathi subset with millions of documents

License

CC-BY-4.0

Format

JSONL / Parquet

Language

Update Frequency

static

Organization

AI4Bharat

Schema

Field	Type	Description
text	string	Curated Marathi web text
source	string	Source of the text

Build With This

Create a comparative study of Marathi LLM pre-training quality using Sangraha vs raw web crawl data

Develop a Marathi writing quality benchmark using Sangraha's curated text as reference for text generation evaluation

Build a Marathi content filtering pipeline that replicates Sangraha's curation approach for new web crawl data

AI Use Cases

Marathi LLM pre-training and continued pre-trainingDomain-specific fine-tuning data extractionWeb content quality analysisMarathi text generation

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09