AI4Bharat Kathbath (Marathi)
Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.
Build a production-ready Marathi speech-to-text API for customer service automation in Maharashtra banks and telecom companies.
Quick Start
from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Sentence: {ex['sentence'][:80]}...")
if i >= 4: break
Size
~140 hrs Marathi (1,684 hrs total across 12 langs)
Schema
| Field | Type | Description |
|---|
| audio | audio | Audio waveform of the speech recording (WAV format) |
| sentence | string | Human-verified transcription of the audio |
| locale | string | Language locale code (mr_IN for Marathi) |
| split | string | Dataset split (train, valid, test_known, test_unknown) |
Build With This
Create a Marathi dictation app for government officials to transcribe meeting notes and official correspondence in real-time
Develop a voice-based form filling system for illiterate citizens accessing Maharashtra government services
Build an automated Marathi subtitling service for educational video content from Maharashtra state board schools
AI Use Cases
ASR trainingSpeech benchmarkingVoice-based applicationsMarathi speech recognition
Last verified: 2026-03-07