Speech & Audio

Speech recognition, text-to-speech, and audio datasets covering Marathi and Indian languages.

18 datasets

Largest Indic speech translation dataset with curated, web-mined, and synthetic speech-text pairs for 13 Indian languages

Build a Marathi speech-to-speech translation system for real-time interpretation in multilingual Maharashtra settings.
Speech + Text (Translation)
AI4Bharat, IIT Madras

Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.

Build a Marathi voice assistant that handles read, extempore, and conversational speech styles for agricultural advisory services.
speech+text
AI4Bharat

ASR-enhanced high-quality TTS corpus for 22 Indian languages; subset of IndicVoices optimized for speech synthesis

Build a Marathi read-speech ASR model optimized for formal reading scenarios like news broadcasting and audiobook narration.
Speech + Text (TTS-ready)
AI4Bharat, IIT Madras

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform

Build a Marathi ASR system with human-verified labels for high-accuracy speech recognition in formal settings.
Speech + Text
AI4Bharat, IIT Madras

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.

Build a production-ready Marathi speech-to-text API for customer service automation in Maharashtra banks and telecom companies.
speech+text
AI4Bharat

Expressive multilingual TTS dataset with neutral and emotional speech (6 Ekman emotions) for 22 Indian languages

Build a Marathi speech emotion recognition system for call center analytics to detect customer sentiment from voice.
Speech + Text (Expressive TTS)
AI4Bharat, IIT Madras

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages

Build a news-domain Marathi ASR model trained on radio broadcast speech for automated news transcription.
Speech + Text
AI4Bharat, IIT Madras

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.

Build a Marathi news transcription service that automatically generates text from radio broadcasts for rural information dissemination.
speech+text
AI4Bharat

Few-shot speech benchmark derived from FLoRes MT benchmark; read-speech in 102 languages including Marathi (mr_in)

Benchmark Marathi ASR models against international standards using the FLEURS evaluation set.
Speech + Text
Google Research

450-hour annotated dataset of Hindi-Marathi code-switching speech, including tag-switching, intra-sentential, and inter-sentential code-mixing patterns. Designed for automatic speech recognition in multilingual contexts common in Maharashtra where Hindi-Marathi mixing is prevalent.

Build an ASR system that handles Hindi-Marathi code-switching for real-world conversational settings in Maharashtra.
audio
Research

Studio-quality single-speaker TTS corpus with male and female Marathi recordings plus English by native speakers

Build a natural-sounding Marathi text-to-speech system using IndicTTS studio recordings for voice assistant applications.
Speech + Text (TTS)
IIT Madras

5-second MP3 audio samples across 10 Indian languages including Marathi, sourced from YouTube regional videos. Designed for spoken language identification and audio classification tasks rather than ASR transcription.

Build a multilingual Indian language identification system from audio that includes Marathi detection.
audio
Independent researcher (Kaggle)

Crowdsourced conversational Marathi speech from three user demographics (rural, urban, student)

Build a robust Marathi ASR model using the Microsoft-IITB corpus for enterprise voice applications.
Speech + Text
Microsoft Research India / IIT Bombay

Crowd-sourced read-speech recordings with validated transcriptions for Marathi

Build a community-driven Marathi speech recognition model using validated Common Voice recordings.
Speech + Text
Mozilla Foundation

Crowd-sourced read-speech recordings with validated transcriptions for Marathi, with approximately 30 hours total and 21 hours validated, part of Mozilla's open voice dataset initiative.

Build a Marathi voice assistant for farmers
speech+text
Mozilla Foundation

Multilingual and code-switching ASR challenge dataset with Marathi speech from diverse speaker groups (college students, rural/urban workers)

Build a competition-grade Marathi ASR model using MUCS 2021 challenge data for benchmarking against other systems.
Speech + Text
MediaEval / MUCS Challenge Organizers

Crowdsourced high-quality multi-speaker Marathi speech corpus for TTS; female speakers only

Build a baseline Marathi ASR model using the OpenSLR-64 corpus for comparison with larger training sets.
Speech + Text (TTS)
OpenSLR / Google

Part of the largest publicly available dialect-rich read-speech corpus for Indian languages, comprising 10,000+ hours validated audio across 9 languages. Marathi subset covers agriculture and finance domains with dialect-aware phonetic lexicons and speaker metadata. Captures rural speech patterns that urban-centric datasets miss.

Build a dialect-robust Marathi ASR model that performs well across regional speech varieties in Maharashtra.
audio
IISc Bangalore / SPIRE Lab