AI4Bharat IndicCorp v1 (mr)

AI4Bharat IndicCorp v1 (mr)

MH Specific

AI4Bharat IndicCorp v1 (mr) dataset for language nlp.

Pre-train Marathi word embeddings on IndicCorp v1 to create static and contextual embeddings for downstream NLP tasks.
Homepage

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicCorp-v1', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...\n")
    if i >= 4: break
Modality
text
Size
551M tokens
License
Format
CSV/JSON
Language
mr
Update Frequency
static
Organization
AI4Bharat, IIT Madras

Schema

FieldTypeDescription
textstringMarathi text document from web crawl

Build With This

Create a Marathi language model pre-training pipeline comparing v1 and v2 corpus quality for downstream task performance
Develop a Marathi domain classifier trained on web text to automatically categorize documents by topic
Build a Marathi text deduplication tool to identify and remove near-duplicate content across corpora

AI Use Cases

Language model pretraining
Last verified: 2026-03-07