Hindi-Marathi Code-Switching ASR Dataset

MH Specific

450-hour annotated dataset of Hindi-Marathi code-switching speech, including tag-switching, intra-sentential, and inter-sentential code-mixing patterns. Designed for automatic speech recognition in multilingual contexts common in Maharashtra where Hindi-Marathi mixing is prevalent.

Build an ASR system that handles Hindi-Marathi code-switching for real-world conversational settings in Maharashtra.

Homepage Paper

Quick Start

# Hindi-Marathi code-switching ASR dataset
# Access from respective research paper/repository
print("Hindi-Marathi code-switching ASR dataset")
print("Check paper references for download instructions")

Modality

audio

Size

450 hours of annotated speech; balanced code-switching patterns

License

Research

Format

Audio + transcriptions

Language

mr, hi

Update Frequency

static

Organization

Research

Schema

Field	Type	Description
audio	audio	Code-switched Hindi-Marathi speech audio
transcription	string	Transcription with language tags

Build With This

Create a bilingual meeting transcription system for Maharashtra offices where Hindi-Marathi switching is common

Develop a code-switching language model that predicts switching points between Hindi and Marathi in speech

Build a customer service bot that understands mixed Hindi-Marathi queries from Mumbai and Pune callers

AI Use Cases

Code-switched speech recognition for MaharashtraBilingual virtual assistant developmentLanguage identification in mixed speechMultilingual contact center transcription

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices

speech+text

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath

Speech + Text

Last verified: 2026-03-09