AI4Bharat Shrutilipi

AI4Bharat Shrutilipi

MH Specific

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages

Build a news-domain Marathi ASR model trained on radio broadcast speech for automated news transcription.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Shrutilipi', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['sentence'][:80]}...")
    if i >= 4: break
Modality
Speech + Text
Size
6,400+ hrs total; ~1,020 hrs Marathi
License
Format
WAV
Language
mr
Update Frequency
static
Organization
AI4Bharat, IIT Madras

Schema

FieldTypeDescription
audioaudioSpeech audio from All India Radio
sentencestringPseudo-labeled transcription

Build With This

Create an automated radio monitoring system that transcribes and indexes Marathi radio broadcasts
Develop a broadcast-quality Marathi speech dataset by filtering high-confidence pseudo-labels from Shrutilipi
Build a Marathi news keyword spotter that detects mentions of specific topics in radio broadcasts in real-time

AI Use Cases

ASR Pre-trainingNews Domain ASR
Last verified: 2026-03-07