AI4Bharat Kathbath (Marathi)

AI4Bharat Kathbath (Marathi)

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform with approximately 140 hours per language.

Build a production-ready Marathi speech-to-text API for customer service automation in Maharashtra banks and telecom companies.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Sentence: {ex['sentence'][:80]}...")
    if i >= 4: break
Modality
speech+text
Size
~140 hrs Marathi (1,684 hrs total across 12 langs)
License
Format
WAV
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
audioaudioAudio waveform of the speech recording (WAV format)
sentencestringHuman-verified transcription of the audio
localestringLanguage locale code (mr_IN for Marathi)
splitstringDataset split (train, valid, test_known, test_unknown)

Build With This

Create a Marathi dictation app for government officials to transcribe meeting notes and official correspondence in real-time
Develop a voice-based form filling system for illiterate citizens accessing Maharashtra government services
Build an automated Marathi subtitling service for educational video content from Maharashtra state board schools

AI Use Cases

ASR trainingSpeech benchmarkingVoice-based applicationsMarathi speech recognition
Last verified: 2026-03-07