Benchmarks, Tools & Dialects

Evaluation benchmarks, NLP toolkits, dialect resources, and fairness datasets for Marathi.

15 datasets

BiasShades Marathi (LLM Bias Evaluation)

728 stereotypes with contrasts in parallel across 16 languages including Marathi. Annotated with regional and demographic features for evaluating LLM bias. The only bias/fairness evaluation dataset available in Marathi, critical for responsible AI development.

Build a fairness auditing tool for Marathi NLP models that measures bias across caste, religion, and gender dimensions.

textCC-BY-SA-4.0

LanguageShades

FLORES-200 Benchmark

Marathi — Human-translated evaluation benchmark for machine translation covering 200+ languages including Marathi, with 3,001 sentences from diverse web articles

Benchmark Marathi machine translation quality against 200 languages using the FLORES-200 evaluation set.

Text (parallel, Marathi)CC BY-SA 4.0

Meta AI

Google Fonts Devanagari Collection

Collection of 50+ freely licensed Devanagari typefaces from Google Fonts including Tiro Devanagari Marathi (designed specifically for Marathi typographic conventions), Noto Sans/Serif Devanagari (comprehensive Unicode coverage), Anek Devanagari (variable weight font), and dozens more. Combined with 283+ fonts from DevanagariFonts.net, these provide the font diversity needed for synthetic OCR training data generation. Critical for training OCR models that generalize across the wide range of fonts used in Marathi printing.

Build a Devanagari font sampler that renders Marathi text across all available fonts for visual comparison and OCR testing.

Font files (TTF/OTF)OFL (SIL Open Font License) / Apache 2.0

Google / SIL International / Community

Indic NLP Library

Python library for Indian language text processing including tokenisation, normalisation, script conversion, and transliteration with full support for Devanagari/Marathi

Build a Marathi text preprocessing pipeline using Indic NLP Library for tokenization, normalization, and script conversion.

Tools (Python)GPL v3

AI4Bharat / Anuvaad

IndicGLUE Benchmark

Marathi Subset — Natural language understanding benchmark for 11 Indian languages including Marathi, covering tasks like news categorisation, headline prediction, and paraphrase detection

Run comprehensive NLU benchmarks on Marathi models using IndicGLUE to identify areas needing improvement.

Text (Marathi)Open Research

AI4Bharat, IIT Madras

IndicXNLI Marathi

Natural Language Inference (NLI) dataset for 11 Indic languages including Marathi, created by high-quality machine translation of the English XNLI dataset. Contains premise-hypothesis pairs with entailment, contradiction, and neutral labels for evaluating Marathi language understanding.

Build a Marathi fact-checking assistant that uses natural language inference to verify claims against known facts.

textCC-BY-NC-4.0

AI4Bharat

IndicXTREME Benchmark

Marathi — Comprehensive NLU benchmark of 9 tasks across 20 Indian languages including Marathi, covering classification, structure prediction, QA, and sentence retrieval

Evaluate Marathi language models on IndicXTREME's diverse task suite for comprehensive performance assessment.

Text (Marathi)Open Research

AI4Bharat, IIT Madras

iNLTK (Natural Language Toolkit for Indic Languages)

Deep learning-based NLP library supporting Marathi with pre-trained language models, text generation, tokenisation, sentence embeddings, and data augmentation

Build a Marathi NLP application using iNLTK's pre-trained models for text generation and classification.

Tools (Python)MIT

iNLTK Community

L3Cube-MahaNLP Toolkit

Comprehensive Marathi NLP library including MahaBERT, MahaAlBERT, MahaRoBERTa language models, MahaFT word embeddings, and tools for tokenisation, sentiment, NER, and hate speech detection

Build an end-to-end Marathi NLP pipeline using L3Cube models for text classification, NER, and sentiment analysis.

Models, Tools (Python)Open Research

L3Cube, Pune

MahaBERT / IndicBERT Evaluation Benchmarks

Evaluation results and benchmark scores for MahaBERT (L3Cube) and IndicBERT (AI4Bharat) models on Marathi NLU tasks including sentiment, NER, and text classification

Build a Marathi model comparison framework using MahaBERT/IndicBERT benchmarks to guide model selection.

Benchmarks (tables)Open Research

L3Cube, Pune

Marathi Dialect Resources (Varhadi, Malvani, Deshi)

Regional dialect data and linguistic documentation for major Marathi dialect varieties including Varhadi (Vidarbha), Malvani (Konkan coast), and Deshi (Western Maharashtra)

Build a Marathi dialect identification system that classifies text by regional dialect for sociolinguistic research.

Text (Marathi dialects)Varies

Various Research Institutions

OCR Synthetic Benchmark for Indic Languages

MH subset needed

Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.

Benchmark your Marathi OCR model against standardized synthetic test sets to measure progress and compare with published results.

Image (synthetic text renders with ground truth)Research use

Research community

Patram-7B - Indian Document Vision-Language Model

Open-source 7B parameter vision-language model specifically trained for Indian document understanding, from government forms to handwritten pages. Handles the varied structure of scanned and photographed Indian documents including Devanagari text. Achieves strong scores on DocVQA (0.855), VisualMRC (0.851), and the custom Patram-Bench. Can be fine-tuned for specific Marathi document types (7/12 extracts, certificates, forms). Represents the current state-of-the-art in open-source Indian document AI.

Fine-tune Patram-7B on Maharashtra government documents (7/12 extracts, certificates) for automated field extraction.

Model (Vision-Language Model for document understanding)Open source (HuggingFace)

BharatGenAI

Sarvam MMLU-Indic (Marathi)

Translated MMLU (Massive Multitask Language Understanding) benchmark in 10 Indian languages including Marathi. Contains multiple-choice questions spanning science, humanities, social sciences, and more. Standard benchmark for evaluating how well Marathi LLMs compare to English ones.

Benchmark Marathi language models on MMLU-Indic to measure knowledge and reasoning capabilities in Marathi.

textOpen

Sarvam AI

SynthOCR-Gen - Synthetic OCR Data Generator for Devanagari

Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.

Generate 1M+ synthetic Marathi word images using L3Cube MahaCorpus text and 283 Devanagari fonts.

Tool (synthetic image generator)Open source

Research community