mC4 (Marathi) - Awesome Marathi Datasets

mC4 (Marathi)

MH Specific

Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.

Pre-train a Marathi-specific language model on the mC4 corpus to create a foundation model for downstream NLP tasks like classification and QA.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('mc4', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"URL: {ex['url'][:60]}")
    print(f"Text: {ex['text'][:100]}...\n")
    if i >= 4: break

Modality

text

Size

~7.8M documents, ~14B tokens (raw)

License

ODC-BY-1.0

Format

JSON

Language

Update Frequency

static

Organization

Google / Allen AI

Schema

Field	Type	Description
text	string	Web-crawled Marathi text document
timestamp	string	Crawl timestamp of the document
url	string	Source URL of the crawled page

Build With This

Create a Marathi domain classifier that categorizes web content by topic (news, government, education, commerce) for targeted corpus curation

Develop a Marathi text quality scorer that filters high-quality documents from mC4 for cleaner language model training

Build a Marathi web content analyzer that identifies emerging topics and trends from the crawled corpus for media monitoring

AI Use Cases

Language model pretrainingLarge-scale text miningUnsupervised representation learningMarathi LLM training

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07