Indian Languages Audio Dataset

Indian Languages Audio Dataset

5-second MP3 audio samples across 10 Indian languages including Marathi, sourced from YouTube regional videos. Designed for spoken language identification and audio classification tasks rather than ASR transcription.

Build a multilingual Indian language identification system from audio that includes Marathi detection.
HomepageDownload

Quick Start

# Indian Languages Audio Dataset
import torchaudio
# Filter for Marathi subset
print("Access the Indian Languages Audio Dataset")
print("Filter for Marathi (mr) language code")
Modality
audio
Size
~748 MB; 5-second clips; 10 Indian languages including Marathi
License
Format
MP3
Language
mr, hi, en
Update Frequency
static
Organization
Independent researcher (Kaggle)

Schema

FieldTypeDescription
audioaudioAudio recording in Indian language
textstringTranscription text
languagestringLanguage identifier

Build With This

Create a language-specific audio router for multilingual call centers serving Maharashtra's diverse population
Develop an Indian language ASR meta-model that leverages cross-lingual transfer from this multi-language dataset
Build a dialectal variation study comparing Marathi audio features against other Indo-Aryan languages in the dataset

AI Use Cases

Spoken language identification (Marathi vs other Indian languages)Audio classification and dialect detectionMultilingual call routing for contact centersLanguage detection in mixed-language broadcast media
Last verified: 2026-03-09