mC4 (Marathi)

mC4 (Marathi)

MH Specific

Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.

Pre-train a Marathi-specific language model on the mC4 corpus to create a foundation model for downstream NLP tasks like classification and QA.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('mc4', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"URL: {ex['url'][:60]}")
    print(f"Text: {ex['text'][:100]}...\n")
    if i >= 4: break
Modality
text
Size
~7.8M documents, ~14B tokens (raw)
License
Format
JSON
Language
mr
Update Frequency
static
Organization
Google / Allen AI

Schema

FieldTypeDescription
textstringWeb-crawled Marathi text document
timestampstringCrawl timestamp of the document
urlstringSource URL of the crawled page

Build With This

Create a Marathi domain classifier that categorizes web content by topic (news, government, education, commerce) for targeted corpus curation
Develop a Marathi text quality scorer that filters high-quality documents from mC4 for cleaner language model training
Build a Marathi web content analyzer that identifies emerging topics and trends from the crawled corpus for media monitoring

AI Use Cases

Language model pretrainingLarge-scale text miningUnsupervised representation learningMarathi LLM training
Last verified: 2026-03-07