Mozilla Common Voice (Marathi)
MH SpecificCrowd-sourced read-speech recordings with validated transcriptions for Marathi
Build a community-driven Marathi speech recognition model using validated Common Voice recordings.
Quick Start
from datasets import load_dataset
ds = load_dataset('mozilla-foundation/common_voice_17_0', 'mr', split='train', streaming=True)
for i, ex in enumerate(ds):
print(f"Text: {ex['sentence']}, Votes: +{ex['up_votes']}/-{ex['down_votes']}")
if i >= 4: break
Size
~30 hrs total, ~21 hrs validated
Organization
Mozilla Foundation
Schema
| Field | Type | Description |
|---|
| audio | audio | Crowd-sourced speech recording |
| sentence | string | Reference text that was read |
| up_votes | int | Validation up-votes |
| down_votes | int | Validation down-votes |
Build With This
Create a Marathi speech data quality pipeline that filters Common Voice by validation scores for cleaner training
Develop a Marathi pronunciation dictionary from Common Voice audio-text alignments
Build a Marathi voice interface SDK for app developers with pre-trained models on Common Voice data
AI Use Cases
ASRSpeaker IDPronunciation Modeling
Last verified: 2026-03-07