AI4Bharat Kathbath

AI4Bharat Kathbath

MH Specific

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform

Build a Marathi ASR system with human-verified labels for high-accuracy speech recognition in formal settings.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Sentence: {ex['sentence'][:80]}...")
    if i >= 4: break
Modality
Speech + Text
Size
1,684 hrs total (12 langs); ~140 hrs/lang median
License
Format
WAV
Language
mr
Update Frequency
static
Organization
AI4Bharat, IIT Madras

Schema

FieldTypeDescription
audioaudioSpeech audio recording
sentencestringHuman-verified transcription

Build With This

Create a Marathi dictation system for legal professionals transcribing court proceedings
Develop an educational speech assessment tool that evaluates Marathi pronunciation accuracy
Build a customer service call transcription system for Maharashtra-based call centers

AI Use Cases

ASRSpeech Benchmarking
Last verified: 2026-03-07