AI4Bharat IndicCorp v2 (Marathi)

AI4Bharat IndicCorp v2 (Marathi)

MH Specific

Marathi subset of the massive 20.9 billion token multilingual corpus covering 24 Indic languages, collected from diverse web sources for language model pretraining.

Train a large Marathi language model on this corpus and fine-tune it for downstream NLP tasks like summarization and QA.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicCorpV2', 'mr', streaming=True)
for i, ex in enumerate(ds['train']):
    print(f"Text: {ex['text'][:150]}...")
    if i >= 2: break
Modality
text
Size
Part of 20.9B token multilingual corpus
License
Format
Parquet
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
textstringMarathi text document from the web-crawled corpus

Build With This

Fine-tune a GPT-style model on this data to create a Marathi text generation service for content creation
Use this corpus to pre-train domain-specific embeddings for Marathi semantic search applications
Build a Marathi autocomplete and writing assistant powered by a language model trained on this corpus

AI Use Cases

Language model pretrainingWord embedding trainingText classificationTransfer learning for Marathi NLP
Last verified: 2026-03-07