AI4Bharat IndicVoices

Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.

Build a Marathi voice assistant that handles read, extempore, and conversational speech styles for agricultural advisory services.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicVoices', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Duration: {ex['duration']:.1f}s, Transcript: {ex['transcript'][:60]}...")
    if i >= 4: break

Modality

speech+text

Size

7,348 hrs total (22 langs); ~334 hrs/lang median

License

CC-BY-4.0

Format

WAV

Language

Update Frequency

static

Organization

AI4Bharat

Schema

Field	Type	Description
audio	audio	Audio waveform of the speech recording
transcript	string	Transcription of the spoken content
language	string	ISO language code (e.g., mr for Marathi)
speaker_id	string	Unique speaker identifier
duration	float	Duration of audio clip in seconds

Build With This

Create a Marathi-Hindi-English code-switching ASR model trained on the conversational speech subset for real-world call center applications

Develop a speaker diarization system for Marathi meetings and interviews using the multi-speaker conversational data

Build a speech style classifier that distinguishes read, spontaneous, and conversational Marathi speech for adaptive ASR pipelines

AI Use Cases

ASR trainingLanguage identificationSpeaker verificationMultilingual speech processing

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath

Speech + Text

AI4Bharat Kathbath (Marathi)

speech+text

Last verified: 2026-03-07