AI4Bharat Shrutilipi (Marathi)
Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.
Build a Marathi news transcription service that automatically generates text from radio broadcasts for rural information dissemination.
Quick Start
from datasets import load_dataset
ds = load_dataset('ai4bharat/Shrutilipi', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Duration: {ex['duration']:.1f}s")
print(f"Text: {ex['sentence'][:80]}...\n")
if i >= 4: break
Size
~1,020 hrs Marathi (6,400+ hrs total)
Schema
| Field | Type | Description |
|---|
| audio | audio | Audio waveform from All India Radio news bulletin |
| sentence | string | Pseudo-labeled transcription of the audio |
| language | string | Language code (mr for Marathi) |
| duration | float | Duration of audio clip in seconds |
Build With This
Create a Marathi news summarization pipeline that transcribes radio bulletins and generates concise text summaries for mobile alerts
Develop a domain-adapted ASR model fine-tuned on formal Marathi news speech for government communication transcription
Build an accessibility tool that converts Marathi radio content into searchable text archives for hearing-impaired users
AI Use Cases
ASR pre-trainingNews domain ASRSpeech-to-text for MarathiSemi-supervised speech learning
Last verified: 2026-03-07