Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.
# Microsoft Updesh Marathi corpus
from datasets import load_dataset
print("Access Microsoft Updesh from HuggingFace or Microsoft Research")
print("Filter for Marathi language subset")| Field | Type | Description |
|---|---|---|
| audio | audio | Conversational Marathi speech recording |
| text | string | Transcription of the speech |