Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.
from datasets import load_dataset
ds = load_dataset('Helsinki-NLP/opus-100', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"EN: {ex['translation']['en'][:60]}...")
print(f"MR: {ex['translation']['mr'][:60]}...\n")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| src | string | Source sentence (English or other language) |
| tgt | string | Target sentence in Marathi |
| corpus | string | Source corpus name (OpenSubtitles, WikiMatrix, etc.) |