English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.
from datasets import load_dataset
ds = load_dataset('pmindia', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"EN: {ex['translation']['en'][:60]}...")
print(f"MR: {ex['translation']['mr'][:60]}...\n")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| en | string | English sentence from PM India website |
| mr | string | Marathi translation of the sentence |