Crowdsourced human-labeled ASR dataset for 12 Indian languages including Marathi, collected via the Karya platform
from datasets import load_dataset
ds = load_dataset('ai4bharat/Kathbath', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Sentence: {ex['sentence'][:80]}...")
if i >= 4: break| Field | Type | Description |
|---|---|---|
| audio | audio | Speech audio recording |
| sentence | string | Human-verified transcription |