AI4Bharat IndicParaphrase (mr)

AI4Bharat IndicParaphrase (mr)

MH Specific

AI4Bharat IndicParaphrase (mr) dataset for language nlp.

Build a Marathi duplicate content detector for news agencies to identify when multiple outlets cover the same story differently.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicParaphrase', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"S1: {ex['sentence1'][:60]}...")
    print(f"S2: {ex['sentence2'][:60]}...")
    print(f"Paraphrase: {bool(ex['label'])}\n")
    if i >= 4: break
Modality
text
Size
Multilingual paraphrase dataset for 10 languages
License
Format
CSV/JSON
Language
mr
Update Frequency
static
Organization
AI4Bharat, IIT Madras

Schema

FieldTypeDescription
sentence1stringFirst Marathi sentence
sentence2stringSecond Marathi sentence (paraphrase or non-paraphrase)
labelint1 if paraphrase, 0 if not

Build With This

Create a Marathi question deduplication system for educational Q&A platforms to merge equivalent student questions
Develop a plagiarism checker for Marathi academic papers that identifies paraphrased content across submissions
Build a Marathi semantic search engine that retrieves relevant documents even when queries use different wording

AI Use Cases

Paraphrase generation
Last verified: 2026-03-07