CC-100 (mr) - Awesome Marathi Datasets

CC-100 (mr)

MH Specific

CC-100 (mr) dataset for language nlp.

Use CC-100 Marathi data to train sentence-level embeddings for building a Marathi semantic search system.

Quick Start

from datasets import load_dataset
ds = load_dataset('cc100', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break

Modality

text

Size

334M tokens

License

MIT

Format

CSV/JSON

Language

Update Frequency

static

Organization

Meta AI / Johns Hopkins University

Schema

Field	Type	Description
text	string	Marathi text from web crawl (Common Crawl)

Build With This

Create a data quality scoring pipeline that filters CC-100 Marathi text by linguistic quality for cleaner pre-training

Develop a Marathi topic modeling system to discover latent themes in web-crawled content

Build a Marathi language identification model trained to distinguish Marathi from closely related Konkani and Hindi text

AI Use Cases

Language model pretraining

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07