OCR Synthetic Benchmark for Indic Languages

MH Subset Needed

Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.

Benchmark your Marathi OCR model against standardized synthetic test sets to measure progress and compare with published results.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage Paper

Quick Start

# OCR Synthetic Benchmark for Indic Languages
# Paper: https://arxiv.org/abs/2205.02543
from PIL import Image

# Load Marathi subset of the benchmark
# Evaluate OCR models using Character Error Rate (CER) and Word Error Rate (WER)
print("Benchmark: 90K synthetic images across 23 Indic languages")
print("Metrics: CER (Character Error Rate), WER (Word Error Rate)")

Modality

Image (synthetic text renders with ground truth)

Size

90K images; 23 Indic languages

License

Research use

Format

PNG/JPEG with text labels

Language

mr, hi, bn, ta, te, kn, ml, gu, pa, or, as

Update Frequency

static

Organization

Research community

Schema

Field	Type	Description
image	image	Synthetically rendered text image
text	string	Ground-truth text
language	string	Language/script identifier
font_size	int	Font size used for rendering

Build With This

Create a Marathi OCR leaderboard tracking model performance on this standardized benchmark over time

Develop an error analysis tool that categorizes OCR mistakes by character type (vowel, consonant, conjunct, matra) on benchmark results

Build an automated OCR regression testing pipeline that runs benchmarks on every model update

AI Use Cases

Marathi OCR model benchmarkingCross-script OCR performance comparisonStandardized evaluation for Indic language OCRRegression testing for OCR model updates

Related Datasets

BiasShades Marathi (LLM Bias Evaluation)

text

FLORES-200 Benchmark

Text (parallel, Marathi)

Google Fonts Devanagari Collection

Font files (TTF/OTF)

Indic NLP Library

Tools (Python)

Last verified: 2026-03-12