Microsoft Updesh (Marathi LLM Post-Training)

Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.

Build a conversational Marathi ASR system for voice assistants that handles natural, unscripted speech patterns.

Homepage HuggingFace

Quick Start

# Microsoft Updesh Marathi corpus
from datasets import load_dataset
print("Access Microsoft Updesh from HuggingFace or Microsoft Research")
print("Filter for Marathi language subset")

Modality

text

Size

~8.9M data points across 13 languages; significant Marathi subset

License

Microsoft Research License

Format

JSON / Parquet

Language

Update Frequency

static

Organization

Microsoft Research

Schema

Field	Type	Description
audio	audio	Conversational Marathi speech recording
text	string	Transcription of the speech

Build With This

Create a Marathi conversation summarizer that transcribes and summarizes informal phone conversations

Develop a Marathi speaker turn detection system for multi-party conversational speech

Build a conversational AI training pipeline using transcribed Marathi dialogue data for chatbot development

AI Use Cases

Marathi LLM fine-tuning and alignmentInstruction-following capability trainingReasoning task training in MarathiOpen-domain Marathi content generation

Related Datasets

AI4Bharat IndicQA

Text (Marathi)

Government Scheme Documents for RAG

Text (PDF, web)

Maharashtra Government Resolutions (mahGRs)

Text (Marathi + English)

Marathi Alpaca Instruction Dataset

Text (Marathi)

Last verified: 2026-03-09