Evaluation benchmarks, NLP toolkits, dialect resources, and fairness datasets for Marathi.
15 datasets
728 stereotypes with contrasts in parallel across 16 languages including Marathi. Annotated with regional and demographic features for evaluating LLM bias. The only bias/fairness evaluation dataset available in Marathi, critical for responsible AI development.
Marathi — Human-translated evaluation benchmark for machine translation covering 200+ languages including Marathi, with 3,001 sentences from diverse web articles
Collection of 50+ freely licensed Devanagari typefaces from Google Fonts including Tiro Devanagari Marathi (designed specifically for Marathi typographic conventions), Noto Sans/Serif Devanagari (comprehensive Unicode coverage), Anek Devanagari (variable weight font), and dozens more. Combined with 283+ fonts from DevanagariFonts.net, these provide the font diversity needed for synthetic OCR training data generation. Critical for training OCR models that generalize across the wide range of fonts used in Marathi printing.
Python library for Indian language text processing including tokenisation, normalisation, script conversion, and transliteration with full support for Devanagari/Marathi
Marathi Subset — Natural language understanding benchmark for 11 Indian languages including Marathi, covering tasks like news categorisation, headline prediction, and paraphrase detection
Natural Language Inference (NLI) dataset for 11 Indic languages including Marathi, created by high-quality machine translation of the English XNLI dataset. Contains premise-hypothesis pairs with entailment, contradiction, and neutral labels for evaluating Marathi language understanding.
Marathi — Comprehensive NLU benchmark of 9 tasks across 20 Indian languages including Marathi, covering classification, structure prediction, QA, and sentence retrieval
Deep learning-based NLP library supporting Marathi with pre-trained language models, text generation, tokenisation, sentence embeddings, and data augmentation
Comprehensive Marathi NLP library including MahaBERT, MahaAlBERT, MahaRoBERTa language models, MahaFT word embeddings, and tools for tokenisation, sentiment, NER, and hate speech detection
Evaluation results and benchmark scores for MahaBERT (L3Cube) and IndicBERT (AI4Bharat) models on Marathi NLU tasks including sentiment, NER, and text classification
Regional dialect data and linguistic documentation for major Marathi dialect varieties including Varhadi (Vidarbha), Malvani (Konkan coast), and Deshi (Western Maharashtra)
Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.
Open-source 7B parameter vision-language model specifically trained for Indian document understanding, from government forms to handwritten pages. Handles the varied structure of scanned and photographed Indian documents including Devanagari text. Achieves strong scores on DocVQA (0.855), VisualMRC (0.851), and the custom Patram-Bench. Can be fine-tuned for specific Marathi document types (7/12 extracts, certificates, forms). Represents the current state-of-the-art in open-source Indian document AI.
Translated MMLU (Massive Multitask Language Understanding) benchmark in 10 Indian languages including Marathi. Contains multiple-choice questions spanning science, humanities, social sciences, and more. Standard benchmark for evaluating how well Marathi LLMs compare to English ones.
Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.