AI4Bharat IndicCorp v1 (mr)

MH Specific

AI4Bharat IndicCorp v1 (mr) dataset for language nlp.

Pre-train Marathi word embeddings on IndicCorp v1 to create static and contextual embeddings for downstream NLP tasks.

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicCorp-v1', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...\n")
    if i >= 4: break

Modality

text

Size

551M tokens

License

CC-BY-NC-4.0

Format

CSV/JSON

Language

Update Frequency

static

Organization

AI4Bharat, IIT Madras

Schema

Field	Type	Description
text	string	Marathi text document from web crawl

Build With This

Create a Marathi language model pre-training pipeline comparing v1 and v2 corpus quality for downstream task performance

Develop a Marathi domain classifier trained on web text to automatically categorize documents by topic

Build a Marathi text deduplication tool to identify and remove near-duplicate content across corpora

AI Use Cases

Language model pretraining

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

AI4Bharat IndicHeadlineGeneration (mr)

text

Last verified: 2026-03-07