Largest publicly available English-Marathi parallel corpus with 3.32 million sentence pairs for machine translation.
from datasets import load_dataset
ds = load_dataset("ai4bharat/samanantar", "mr")
print(ds["train"][0])
# {'src': 'English sentence', 'tgt': 'मराठी वाक्य', ...}| Field | Type | Description |
|---|---|---|
| src | string | Source sentence in English |
| tgt | string | Parallel translation in Marathi |
| src_lang | string | Source language code (en) |
| tgt_lang | string | Target language code (mr) |
| data_source | string | Origin corpus the sentence pair was mined from |