PMIndia Marathi Parallel Corpus

PMIndia Marathi Parallel Corpus

English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.

Build a government domain-specific English-Marathi translator trained on official PM India communications.

Quick Start

from datasets import load_dataset
ds = load_dataset('pmindia', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"EN: {ex['translation']['en'][:60]}...")
    print(f"MR: {ex['translation']['mr'][:60]}...\n")
    if i >= 4: break
Modality
text
Size
~56,000 sentence pairs (English-Marathi)
License
Format
TSV / plain text
Language
en, mr
Update Frequency
static
Organization
University of Edinburgh

Schema

FieldTypeDescription
enstringEnglish sentence from PM India website
mrstringMarathi translation of the sentence

Build With This

Create a government scheme explainer that translates and simplifies central government announcements into accessible Marathi
Develop a bilingual document alignment tool for Maharashtra government offices processing central directives
Build a translation memory system for government translators working on English-Marathi document pairs

AI Use Cases

English-Marathi machine translation trainingGovernment domain language model fine-tuningCross-lingual transfer learningFormal Marathi text generation
Last verified: 2026-03-09