Mozilla Common Voice (Marathi)

Mozilla Common Voice (Marathi)

MH Specific

Crowd-sourced read-speech recordings with validated transcriptions for Marathi

Build a community-driven Marathi speech recognition model using validated Common Voice recordings.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('mozilla-foundation/common_voice_17_0', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
    print(f"Text: {ex['sentence']}, Votes: +{ex['up_votes']}/-{ex['down_votes']}")
    if i >= 4: break
Modality
Speech + Text
Size
~30 hrs total, ~21 hrs validated
License
Format
WAV
Language
mr
Update Frequency
static
Organization
Mozilla Foundation

Schema

FieldTypeDescription
audioaudioCrowd-sourced speech recording
sentencestringReference text that was read
up_votesintValidation up-votes
down_votesintValidation down-votes

Build With This

Create a Marathi speech data quality pipeline that filters Common Voice by validation scores for cleaner training
Develop a Marathi pronunciation dictionary from Common Voice audio-text alignments
Build a Marathi voice interface SDK for app developers with pre-trained models on Common Voice data

AI Use Cases

ASRSpeaker IDPronunciation Modeling
Last verified: 2026-03-07