Multilingual (incl. Marathi) — Human-generated, human-annotated assistant-style conversation corpus in 35 languages including Marathi conversation trees with quality ratings
from datasets import load_dataset
ds = load_dataset('OpenAssistant/oasst2')
print(f'Total messages: {len(ds["train"])}')
mr = [ex for ex in ds['train'] if ex.get('lang') == 'mr']
print(f'Marathi messages: {len(mr)}')| Field | Type | Description |
|---|---|---|
| text | string | Conversation message text |
| role | string | Message role (prompter, assistant) |
| lang | string | Language code |
| rank | int | Quality rank from human evaluation |