Mozilla Common Voice (Marathi)

Mozilla Common Voice (Marathi)

MH Specific

Crowd-sourced read-speech recordings with validated transcriptions for Marathi

Build a community-driven Marathi speech recognition model using validated Common Voice recordings.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('mozilla-foundation/common_voice_17_0', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['sentence']}, Votes: +{ex['up_votes']}/-{ex['down_votes']}")
    if i >= 4: break

Modality

Speech + Text

Size

~30 hrs total, ~21 hrs validated

License

CC-0

Format

WAV

Language

mr

Update Frequency

static

Organization

Mozilla Foundation

Schema

Field	Type	Description
audio	audio	Crowd-sourced speech recording
sentence	string	Reference text that was read
up_votes	int	Validation up-votes
down_votes	int	Validation down-votes

Build With This

Create a Marathi speech data quality pipeline that filters Common Voice by validation scores for cleaner training

Develop a Marathi pronunciation dictionary from Common Voice audio-text alignments

Build a Marathi voice interface SDK for app developers with pre-trained models on Common Voice data

AI Use Cases

ASRSpeaker IDPronunciation Modeling

Related Datasets

AI4Bharat BhasaAnuvaad (Marathi)

Speech + Text (Translation)

AI4Bharat IndicVoices

AI4Bharat IndicVoices-R

Speech + Text (TTS-ready)

AI4Bharat Kathbath

Last verified: 2026-03-07