OPUS Marathi Parallel Corpora

OPUS Marathi Parallel Corpora

Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.

Build a multi-domain English-Marathi translation model by combining parallel corpora from different OPUS sources.
HomepageDownload

Quick Start

from datasets import load_dataset
ds = load_dataset('Helsinki-NLP/opus-100', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"EN: {ex['translation']['en'][:60]}...")
    print(f"MR: {ex['translation']['mr'][:60]}...\n")
    if i >= 4: break
Modality
text
Size
Multiple corpora; thousands to millions of sentence pairs per source
License
Format
TMX / plain text / Moses format
Language
mr, en
Update Frequency
periodic
Organization
OPUS / NLPL

Schema

FieldTypeDescription
srcstringSource sentence (English or other language)
tgtstringTarget sentence in Marathi
corpusstringSource corpus name (OpenSubtitles, WikiMatrix, etc.)

Build With This

Create a domain-adaptive Marathi translation system that selects training data from the most relevant OPUS corpus for each input
Develop a parallel corpus quality filter that scores and ranks translation pairs from OPUS for cleaner model training
Build a Marathi subtitle generator for English video content using OPUS OpenSubtitles-trained translation models

AI Use Cases

Domain-specific machine translationTechnical documentation translation (GNOME, KDE, Ubuntu)Religious text translation and analysisCross-domain translation model training
Last verified: 2026-03-09