Crowd-sourced read-speech recordings with validated transcriptions for Marathi, with approximately 30 hours total and 21 hours validated, part of Mozilla's open voice dataset initiative.
from datasets import load_dataset
ds = load_dataset("mozilla-foundation/common_voice_17_0", "mr", split="train")
sample = ds[0]
print(sample["sentence"], sample["path"])| Field | Type | Description |
|---|---|---|
| client_id | string | Unique hashed speaker identifier |
| path | string | Relative path to the audio clip (MP3/WAV) |
| sentence | string | Transcribed Marathi text for the audio clip |
| up_votes | integer | Number of listener validations confirming correctness |
| down_votes | integer | Number of listener validations marking as incorrect |
| age | string | Self-reported age bracket of the speaker |
| gender | string | Self-reported gender of the speaker |
| accent | string | Self-reported accent or dialect |