Speech & Audio - Awesome Marathi Datasets

Speech & Audio

Speech recognition, text-to-speech, and audio datasets covering Marathi and Indian languages.

17 datasets

Largest Indic speech translation dataset with curated, web-mined, and synthetic speech-text pairs for 13 Indian languages

Build a Marathi speech-to-speech translation system for real-time interpretation in multilingual Maharashtra settings.

Speech + Text (Translation)CC-BY 4.0

AI4Bharat, IIT Madras

AI4Bharat IndicVoices

Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.

Build a Marathi voice assistant that handles read, extempore, and conversational speech styles for agricultural advisory services.

speech+textCC-BY-4.0

AI4Bharat

AI4Bharat IndicVoices-R

ASR-enhanced high-quality TTS corpus for 22 Indian languages; subset of IndicVoices optimized for speech synthesis

Build a Marathi read-speech ASR model optimized for formal reading scenarios like news broadcasting and audiobook narration.

Speech + Text (TTS-ready)CC-BY 4.0

AI4Bharat, IIT Madras

AI4Bharat Kathbath

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform

Build a Marathi ASR system with human-verified labels for high-accuracy speech recognition in formal settings.

Speech + TextCC-0

AI4Bharat, IIT Madras

AI4Bharat Kathbath (Marathi)

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.

Build a production-ready Marathi speech-to-text API for customer service automation in Maharashtra banks and telecom companies.

speech+textCC0-1.0

AI4Bharat

AI4Bharat Rasa (Marathi)

Expressive multilingual TTS dataset with neutral and emotional speech (6 Ekman emotions) for 22 Indian languages

Build a Marathi speech emotion recognition system for call center analytics to detect customer sentiment from voice.

Speech + Text (Expressive TTS)CC-BY 4.0

AI4Bharat, IIT Madras

AI4Bharat Shrutilipi

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages

Build a news-domain Marathi ASR model trained on radio broadcast speech for automated news transcription.

Speech + TextCC-BY 4.0

AI4Bharat, IIT Madras

AI4Bharat Shrutilipi (Marathi)

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.

Build a Marathi news transcription service that automatically generates text from radio broadcasts for rural information dissemination.

speech+textCC-BY-4.0

AI4Bharat

Google FLEURS (Marathi)

Few-shot speech benchmark derived from FLoRes MT benchmark; read-speech in 102 languages including Marathi (mr_in)

Benchmark Marathi ASR models against international standards using the FLEURS evaluation set.

Speech + TextCC-BY 4.0

Google Research

Hindi-Marathi Code-Switching ASR Dataset

450-hour annotated dataset of Hindi-Marathi code-switching speech, including tag-switching, intra-sentential, and inter-sentential code-mixing patterns. Designed for automatic speech recognition in multilingual contexts common in Maharashtra where Hindi-Marathi mixing is prevalent.

Build an ASR system that handles Hindi-Marathi code-switching for real-world conversational settings in Maharashtra.

audioResearch

Research

IIT Madras IndicTTS (Marathi)

Studio-quality single-speaker TTS corpus with male and female Marathi recordings plus English by native speakers

Build a natural-sounding Marathi text-to-speech system using IndicTTS studio recordings for voice assistant applications.

Speech + Text (TTS)Custom academic license (request-based)

IIT Madras

Indian Languages Audio Dataset

5-second MP3 audio samples across 10 Indian languages including Marathi, sourced from YouTube regional videos. Designed for spoken language identification and audio classification tasks rather than ASR transcription.

Build a multilingual Indian language identification system from audio that includes Marathi detection.

audioApache-2.0

Independent researcher (Kaggle)

Microsoft-IITB Marathi Speech Corpus

Crowdsourced conversational Marathi speech from three user demographics (rural, urban, student)

Build a robust Marathi ASR model using the Microsoft-IITB corpus for enterprise voice applications.

Speech + TextNon-commercial research only

Microsoft Research India / IIT Bombay

Mozilla Common Voice (Marathi)

Crowd-sourced read-speech recordings with validated transcriptions for Marathi, with approximately 30 hours total and 21 hours validated, part of Mozilla's open voice dataset initiative.

Build a Marathi voice assistant for farmers

speech+textCC0-1.0

Mozilla Foundation

MUCS 2021 (Marathi)

Multilingual and code-switching ASR challenge dataset with Marathi speech from diverse speaker groups (college students, rural/urban workers)

Build a competition-grade Marathi ASR model using MUCS 2021 challenge data for benchmarking against other systems.

Speech + TextCC-BY 4.0

MediaEval / MUCS Challenge Organizers

OpenSLR-64 (Marathi)

Crowdsourced high-quality multi-speaker Marathi speech corpus for TTS; female speakers only

Build a baseline Marathi ASR model using the OpenSLR-64 corpus for comparison with larger training sets.

Speech + Text (TTS)CC-BY-SA 4.0

OpenSLR / Google

RESPIN Marathi Dialect-Rich Speech Corpus

Part of the largest publicly available dialect-rich read-speech corpus for Indian languages, comprising 10,000+ hours validated audio across 9 languages. Marathi subset covers agriculture and finance domains with dialect-aware phonetic lexicons and speaker metadata. Captures rural speech patterns that urban-centric datasets miss.

Build a dialect-robust Marathi ASR model that performs well across regional speech varieties in Maharashtra.

audioResearch (IISc / Gates Foundation)

IISc Bangalore / SPIRE Lab