Hindi-Marathi Code-Switching ASR Dataset

Hindi-Marathi Code-Switching ASR Dataset

MH Specific

450-hour annotated dataset of Hindi-Marathi code-switching speech, including tag-switching, intra-sentential, and inter-sentential code-mixing patterns. Designed for automatic speech recognition in multilingual contexts common in Maharashtra where Hindi-Marathi mixing is prevalent.

Build an ASR system that handles Hindi-Marathi code-switching for real-world conversational settings in Maharashtra.
HomepagePaper

Quick Start

# Hindi-Marathi code-switching ASR dataset
# Access from respective research paper/repository
print("Hindi-Marathi code-switching ASR dataset")
print("Check paper references for download instructions")
Modality
audio
Size
450 hours of annotated speech; balanced code-switching patterns
License
Format
Audio + transcriptions
Language
mr, hi
Update Frequency
static
Organization
Research

Schema

FieldTypeDescription
audioaudioCode-switched Hindi-Marathi speech audio
transcriptionstringTranscription with language tags

Build With This

Create a bilingual meeting transcription system for Maharashtra offices where Hindi-Marathi switching is common
Develop a code-switching language model that predicts switching points between Hindi and Marathi in speech
Build a customer service bot that understands mixed Hindi-Marathi queries from Mumbai and Pune callers

AI Use Cases

Code-switched speech recognition for MaharashtraBilingual virtual assistant developmentLanguage identification in mixed speechMultilingual contact center transcription
Last verified: 2026-03-09