AI4Bharat IndicVoices

AI4Bharat IndicVoices

Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.

Build a Marathi voice assistant that handles read, extempore, and conversational speech styles for agricultural advisory services.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicVoices', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Duration: {ex['duration']:.1f}s, Transcript: {ex['transcript'][:60]}...")
    if i >= 4: break
Modality
speech+text
Size
7,348 hrs total (22 langs); ~334 hrs/lang median
License
Format
WAV
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
audioaudioAudio waveform of the speech recording
transcriptstringTranscription of the spoken content
languagestringISO language code (e.g., mr for Marathi)
speaker_idstringUnique speaker identifier
durationfloatDuration of audio clip in seconds

Build With This

Create a Marathi-Hindi-English code-switching ASR model trained on the conversational speech subset for real-world call center applications
Develop a speaker diarization system for Marathi meetings and interviews using the multi-speaker conversational data
Build a speech style classifier that distinguishes read, spontaneous, and conversational Marathi speech for adaptive ASR pipelines

AI Use Cases

ASR trainingLanguage identificationSpeaker verificationMultilingual speech processing
Last verified: 2026-03-07