Microsoft Updesh (Marathi LLM Post-Training)

Microsoft Updesh (Marathi LLM Post-Training)

Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.

Build a conversational Marathi ASR system for voice assistants that handles natural, unscripted speech patterns.
HomepageHuggingFace

Quick Start

# Microsoft Updesh Marathi corpus
from datasets import load_dataset
print("Access Microsoft Updesh from HuggingFace or Microsoft Research")
print("Filter for Marathi language subset")
Modality
text
Size
~8.9M data points across 13 languages; significant Marathi subset
License
Format
JSON / Parquet
Language
mr
Update Frequency
static
Organization
Microsoft Research

Schema

FieldTypeDescription
audioaudioConversational Marathi speech recording
textstringTranscription of the speech

Build With This

Create a Marathi conversation summarizer that transcribes and summarizes informal phone conversations
Develop a Marathi speaker turn detection system for multi-party conversational speech
Build a conversational AI training pipeline using transcribed Marathi dialogue data for chatbot development

AI Use Cases

Marathi LLM fine-tuning and alignmentInstruction-following capability trainingReasoning task training in MarathiOpen-domain Marathi content generation
Last verified: 2026-03-09