AI4Bharat Kathbath

MH Specific

Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform

Build a Marathi ASR system with human-verified labels for high-accuracy speech recognition in formal settings.

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Sentence: {ex['sentence'][:80]}...")
    if i >= 4: break

Modality

Speech + Text

Size

1,684 hrs total (12 langs); ~140 hrs/lang median

License

CC-0

Format

WAV

Language

Update Frequency

static

Organization

AI4Bharat, IIT Madras

Schema

Field	Type	Description
audio	audio	Speech audio recording
sentence	string	Human-verified transcription

Build With This

Create a Marathi dictation system for legal professionals transcribing court proceedings

Develop an educational speech assessment tool that evaluates Marathi pronunciation accuracy

Build a customer service call transcription system for Maharashtra-based call centers

AI Use Cases

ASRSpeech Benchmarking

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices

speech+text

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath (Marathi)

speech+text

Last verified: 2026-03-07