Speech recognition, text-to-speech, and audio datasets covering Marathi and Indian languages.
18 datasets
Largest Indic speech translation dataset with curated, web-mined, and synthetic speech-text pairs for 13 Indian languages
Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.
ASR-enhanced high-quality TTS corpus for 22 Indian languages; subset of IndicVoices optimized for speech synthesis
Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform
Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.
Expressive multilingual TTS dataset with neutral and emotional speech (6 Ekman emotions) for 22 Indian languages
Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages
Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.
Few-shot speech benchmark derived from FLoRes MT benchmark; read-speech in 102 languages including Marathi (mr_in)
450-hour annotated dataset of Hindi-Marathi code-switching speech, including tag-switching, intra-sentential, and inter-sentential code-mixing patterns. Designed for automatic speech recognition in multilingual contexts common in Maharashtra where Hindi-Marathi mixing is prevalent.
Studio-quality single-speaker TTS corpus with male and female Marathi recordings plus English by native speakers
5-second MP3 audio samples across 10 Indian languages including Marathi, sourced from YouTube regional videos. Designed for spoken language identification and audio classification tasks rather than ASR transcription.
Crowdsourced conversational Marathi speech from three user demographics (rural, urban, student)
Crowd-sourced read-speech recordings with validated transcriptions for Marathi
Crowd-sourced read-speech recordings with validated transcriptions for Marathi, with approximately 30 hours total and 21 hours validated, part of Mozilla's open voice dataset initiative.
Multilingual and code-switching ASR challenge dataset with Marathi speech from diverse speaker groups (college students, rural/urban workers)
Crowdsourced high-quality multi-speaker Marathi speech corpus for TTS; female speakers only
Part of the largest publicly available dialect-rich read-speech corpus for Indian languages, comprising 10,000+ hours validated audio across 9 languages. Marathi subset covers agriculture and finance domains with dialect-aware phonetic lexicons and speaker metadata. Captures rural speech patterns that urban-centric datasets miss.