CC-100 (mr)

CC-100 (mr)

MH Specific

CC-100 (mr) dataset for language nlp.

Use CC-100 Marathi data to train sentence-level embeddings for building a Marathi semantic search system.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('cc100', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['text'][:100]}...")
    if i >= 4: break
Modality
text
Size
334M tokens
License
Format
CSV/JSON
Language
mr
Update Frequency
static
Organization
Meta AI / Johns Hopkins University

Schema

FieldTypeDescription
textstringMarathi text from web crawl (Common Crawl)

Build With This

Create a data quality scoring pipeline that filters CC-100 Marathi text by linguistic quality for cleaner pre-training
Develop a Marathi topic modeling system to discover latent themes in web-crawled content
Build a Marathi language identification model trained to distinguish Marathi from closely related Konkani and Hindi text

AI Use Cases

Language model pretraining
Last verified: 2026-03-07