Large-scale multilingual speech corpus with read, extempore, and conversational audio across 22 Indian languages including Marathi, totaling 7,348 hours.
from datasets import load_dataset
ds = load_dataset('ai4bharat/IndicVoices', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Duration: {ex['duration']:.1f}s, Transcript: {ex['transcript'][:60]}...")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| audio | audio | Audio waveform of the speech recording |
| transcript | string | Transcription of the spoken content |
| language | string | ISO language code (e.g., mr for Marathi) |
| speaker_id | string | Unique speaker identifier |
| duration | float | Duration of audio clip in seconds |