PMIndia Marathi Parallel Corpus

English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.

Build a government domain-specific English-Marathi translator trained on official PM India communications.

Homepage GitHub Paper Download

Quick Start

from datasets import load_dataset
ds = load_dataset('pmindia', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"EN: {ex['translation']['en'][:60]}...")
    print(f"MR: {ex['translation']['mr'][:60]}...\n")
    if i >= 4: break

Modality

text

Size

~56,000 sentence pairs (English-Marathi)

License

Open

Format

TSV / plain text

Language

en, mr

Update Frequency

static

Organization

University of Edinburgh

Schema

Field	Type	Description
en	string	English sentence from PM India website
mr	string	Marathi translation of the sentence

Build With This

Create a government scheme explainer that translates and simplifies central government announcements into accessible Marathi

Develop a bilingual document alignment tool for Maharashtra government offices processing central directives

Build a translation memory system for government translators working on English-Marathi document pairs

AI Use Cases

English-Marathi machine translation trainingGovernment domain language model fine-tuningCross-lingual transfer learningFormal Marathi text generation

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09