OPUS Marathi Parallel Corpora

Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.

Build a multi-domain English-Marathi translation model by combining parallel corpora from different OPUS sources.

Homepage Download

Quick Start

from datasets import load_dataset
ds = load_dataset('Helsinki-NLP/opus-100', 'en-mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"EN: {ex['translation']['en'][:60]}...")
    print(f"MR: {ex['translation']['mr'][:60]}...\n")
    if i >= 4: break

Modality

text

Size

Multiple corpora; thousands to millions of sentence pairs per source

License

Various (per sub-corpus)

Format

TMX / plain text / Moses format

Language

mr, en

Update Frequency

periodic

Organization

OPUS / NLPL

Schema

Field	Type	Description
src	string	Source sentence (English or other language)
tgt	string	Target sentence in Marathi
corpus	string	Source corpus name (OpenSubtitles, WikiMatrix, etc.)

Build With This

Create a domain-adaptive Marathi translation system that selects training data from the most relevant OPUS corpus for each input

Develop a parallel corpus quality filter that scores and ranks translation pairs from OPUS for cleaner model training

Build a Marathi subtitle generator for English video content using OPUS OpenSubtitles-trained translation models

AI Use Cases

Domain-specific machine translationTechnical documentation translation (GNOME, KDE, Ubuntu)Religious text translation and analysisCross-domain translation model training

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09