L3Cube-MahaParaphrase

MH Specific

L3Cube-MahaParaphrase dataset for language nlp.

Build a semantic search engine for Marathi educational content that finds similar questions and answers across textbooks.

Homepage GitHub

Quick Start

from datasets import load_dataset
ds = load_dataset('l3cube-pune/MahaParaphrase')
for ex in ds['train'][:5]:
    print(f"S1: {ex['sentence1'][:60]}...")
    print(f"S2: {ex['sentence2'][:60]}...")
    print(f"Paraphrase: {bool(ex['label'])}\n")

Modality

text

Size

8,000 sentence pairs

License

CC-BY-NC-SA-4.0

Format

CSV

Language

Update Frequency

static

Organization

L3Cube, Pune

Schema

Field	Type	Description
sentence1	string	First Marathi sentence
sentence2	string	Second Marathi sentence
label	int	Whether the sentences are paraphrases (1) or not (0)

Build With This

Create a question deduplication system for Marathi Q&A platforms that merges duplicate questions and consolidates answers

Develop a FAQ matching service that instantly finds whether a customer's Marathi query already has a documented answer

Build a plagiarism detection tool for Marathi academic submissions that identifies paraphrased content

AI Use Cases

Paraphrase detection

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07