AI4Bharat IndicParaphrase (mr)

MH Specific

AI4Bharat IndicParaphrase (mr) dataset for language nlp.

Build a Marathi duplicate content detector for news agencies to identify when multiple outlets cover the same story differently.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicParaphrase', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"S1: {ex['sentence1'][:60]}...")
    print(f"S2: {ex['sentence2'][:60]}...")
    print(f"Paraphrase: {bool(ex['label'])}\n")
    if i >= 4: break

Modality

text

Size

Multilingual paraphrase dataset for 10 languages

License

CC0-1.0

Format

CSV/JSON

Language

Update Frequency

static

Organization