AI4Bharat Kathbath (Marathi)

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.

Build a production-ready Marathi speech-to-text API for customer service automation in Maharashtra banks and telecom companies.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Sentence: {ex['sentence'][:80]}...")
    if i >= 4: break

Modality

speech+text

Size

~140 hrs Marathi (1,684 hrs total across 12 langs)

License

CC0-1.0

Format

WAV

Language

Update Frequency

static

Organization

AI4Bharat

Schema

Field	Type	Description
audio	audio	Audio waveform of the speech recording (WAV format)
sentence	string	Human-verified transcription of the audio
locale	string	Language locale code (mr_IN for Marathi)
split	string	Dataset split (train, valid, test_known, test_unknown)

Build With This

Create a Marathi dictation app for government officials to transcribe meeting notes and official correspondence in real-time

Develop a voice-based form filling system for illiterate citizens accessing Maharashtra government services

Build an automated Marathi subtitling service for educational video content from Maharashtra state board schools

AI Use Cases

ASR trainingSpeech benchmarkingVoice-based applicationsMarathi speech recognition

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices

speech+text

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath

Speech + Text

Last verified: 2026-03-07