Benchmarks, Tools & Dialects

Evaluation benchmarks, NLP toolkits, dialect resources, and fairness datasets for Marathi.

15 datasets

728 stereotypes with contrasts in parallel across 16 languages including Marathi. Annotated with regional and demographic features for evaluating LLM bias. The only bias/fairness evaluation dataset available in Marathi, critical for responsible AI development.

Build a fairness auditing tool for Marathi NLP models that measures bias across caste, religion, and gender dimensions.
text
LanguageShades

Marathi — Human-translated evaluation benchmark for machine translation covering 200+ languages including Marathi, with 3,001 sentences from diverse web articles

Benchmark Marathi machine translation quality against 200 languages using the FLORES-200 evaluation set.
Text (parallel, Marathi)
Meta AI

Collection of 50+ freely licensed Devanagari typefaces from Google Fonts including Tiro Devanagari Marathi (designed specifically for Marathi typographic conventions), Noto Sans/Serif Devanagari (comprehensive Unicode coverage), Anek Devanagari (variable weight font), and dozens more. Combined with 283+ fonts from DevanagariFonts.net, these provide the font diversity needed for synthetic OCR training data generation. Critical for training OCR models that generalize across the wide range of fonts used in Marathi printing.

Build a Devanagari font sampler that renders Marathi text across all available fonts for visual comparison and OCR testing.
Font files (TTF/OTF)
Google / SIL International / Community

Python library for Indian language text processing including tokenisation, normalisation, script conversion, and transliteration with full support for Devanagari/Marathi

Build a Marathi text preprocessing pipeline using Indic NLP Library for tokenization, normalization, and script conversion.
Tools (Python)
AI4Bharat / Anuvaad

Marathi Subset — Natural language understanding benchmark for 11 Indian languages including Marathi, covering tasks like news categorisation, headline prediction, and paraphrase detection

Run comprehensive NLU benchmarks on Marathi models using IndicGLUE to identify areas needing improvement.
Text (Marathi)
AI4Bharat, IIT Madras

Natural Language Inference (NLI) dataset for 11 Indic languages including Marathi, created by high-quality machine translation of the English XNLI dataset. Contains premise-hypothesis pairs with entailment, contradiction, and neutral labels for evaluating Marathi language understanding.

Build a Marathi fact-checking assistant that uses natural language inference to verify claims against known facts.
text
AI4Bharat

Marathi — Comprehensive NLU benchmark of 9 tasks across 20 Indian languages including Marathi, covering classification, structure prediction, QA, and sentence retrieval

Evaluate Marathi language models on IndicXTREME's diverse task suite for comprehensive performance assessment.
Text (Marathi)
AI4Bharat, IIT Madras

Deep learning-based NLP library supporting Marathi with pre-trained language models, text generation, tokenisation, sentence embeddings, and data augmentation

Build a Marathi NLP application using iNLTK's pre-trained models for text generation and classification.
Tools (Python)
iNLTK Community

Comprehensive Marathi NLP library including MahaBERT, MahaAlBERT, MahaRoBERTa language models, MahaFT word embeddings, and tools for tokenisation, sentiment, NER, and hate speech detection

Build an end-to-end Marathi NLP pipeline using L3Cube models for text classification, NER, and sentiment analysis.
Models, Tools (Python)
L3Cube, Pune

Evaluation results and benchmark scores for MahaBERT (L3Cube) and IndicBERT (AI4Bharat) models on Marathi NLU tasks including sentiment, NER, and text classification

Build a Marathi model comparison framework using MahaBERT/IndicBERT benchmarks to guide model selection.
Benchmarks (tables)
L3Cube, Pune

Regional dialect data and linguistic documentation for major Marathi dialect varieties including Varhadi (Vidarbha), Malvani (Konkan coast), and Deshi (Western Maharashtra)

Build a Marathi dialect identification system that classifies text by regional dialect for sociolinguistic research.
Text (Marathi dialects)
Various Research Institutions

Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.

Benchmark your Marathi OCR model against standardized synthetic test sets to measure progress and compare with published results.
Image (synthetic text renders with ground truth)
Research community

Open-source 7B parameter vision-language model specifically trained for Indian document understanding, from government forms to handwritten pages. Handles the varied structure of scanned and photographed Indian documents including Devanagari text. Achieves strong scores on DocVQA (0.855), VisualMRC (0.851), and the custom Patram-Bench. Can be fine-tuned for specific Marathi document types (7/12 extracts, certificates, forms). Represents the current state-of-the-art in open-source Indian document AI.

Fine-tune Patram-7B on Maharashtra government documents (7/12 extracts, certificates) for automated field extraction.
Model (Vision-Language Model for document understanding)
BharatGenAI

Translated MMLU (Massive Multitask Language Understanding) benchmark in 10 Indian languages including Marathi. Contains multiple-choice questions spanning science, humanities, social sciences, and more. Standard benchmark for evaluating how well Marathi LLMs compare to English ones.

Benchmark Marathi language models on MMLU-Indic to measure knowledge and reasoning capabilities in Marathi.
text
Sarvam AI

Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.

Generate 1M+ synthetic Marathi word images using L3Cube MahaCorpus text and 283 Devanagari fonts.
Tool (synthetic image generator)
Research community