AI4Bharat Shrutilipi (Marathi)

AI4Bharat Shrutilipi (Marathi)

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.

Build a Marathi news transcription service that automatically generates text from radio broadcasts for rural information dissemination.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Shrutilipi', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Duration: {ex['duration']:.1f}s")
    print(f"Text: {ex['sentence'][:80]}...\n")
    if i >= 4: break
Modality
speech+text
Size
~1,020 hrs Marathi (6,400+ hrs total)
License
Format
WAV
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
audioaudioAudio waveform from All India Radio news bulletin
sentencestringPseudo-labeled transcription of the audio
languagestringLanguage code (mr for Marathi)
durationfloatDuration of audio clip in seconds

Build With This

Create a Marathi news summarization pipeline that transcribes radio bulletins and generates concise text summaries for mobile alerts
Develop a domain-adapted ASR model fine-tuned on formal Marathi news speech for government communication transcription
Build an accessibility tool that converts Marathi radio content into searchable text archives for hearing-impaired users

AI Use Cases

ASR pre-trainingNews domain ASRSpeech-to-text for MarathiSemi-supervised speech learning
Last verified: 2026-03-07