AI4Bharat Shrutilipi (Marathi)

Pseudo-labeled ASR corpus mined from All India Radio news bulletins for 12 Indian languages, with approximately 1,020 hours of Marathi speech data.

Build a Marathi news transcription service that automatically generates text from radio broadcasts for rural information dissemination.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/Shrutilipi', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Duration: {ex['duration']:.1f}s")
    print(f"Text: {ex['sentence'][:80]}...\n")
    if i >= 4: break

Modality

speech+text

Size

~1,020 hrs Marathi (6,400+ hrs total)

License

CC-BY-4.0

Format

WAV

Language

Update Frequency

static

Organization

AI4Bharat

Schema

Field	Type	Description
audio	audio	Audio waveform from All India Radio news bulletin
sentence	string	Pseudo-labeled transcription of the audio
language	string	Language code (mr for Marathi)
duration	float	Duration of audio clip in seconds

Build With This

Create a Marathi news summarization pipeline that transcribes radio bulletins and generates concise text summaries for mobile alerts

Develop a domain-adapted ASR model fine-tuned on formal Marathi news speech for government communication transcription

Build an accessibility tool that converts Marathi radio content into searchable text archives for hearing-impaired users

AI Use Cases

ASR pre-trainingNews domain ASRSpeech-to-text for MarathiSemi-supervised speech learning

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices

speech+text

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath

Speech + Text

Last verified: 2026-03-07