OCR Synthetic Benchmark for Indic Languages

OCR Synthetic Benchmark for Indic Languages

MH Subset Needed

Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.

Benchmark your Marathi OCR model against standardized synthetic test sets to measure progress and compare with published results.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.
HomepagePaper

Quick Start

# OCR Synthetic Benchmark for Indic Languages
# Paper: https://arxiv.org/abs/2205.02543
from PIL import Image

# Load Marathi subset of the benchmark
# Evaluate OCR models using Character Error Rate (CER) and Word Error Rate (WER)
print("Benchmark: 90K synthetic images across 23 Indic languages")
print("Metrics: CER (Character Error Rate), WER (Word Error Rate)")
Modality
Image (synthetic text renders with ground truth)
Size
90K images; 23 Indic languages
License
Format
PNG/JPEG with text labels
Language
mr, hi, bn, ta, te, kn, ml, gu, pa, or, as
Update Frequency
static
Organization
Research community

Schema

FieldTypeDescription
imageimageSynthetically rendered text image
textstringGround-truth text
languagestringLanguage/script identifier
font_sizeintFont size used for rendering

Build With This

Create a Marathi OCR leaderboard tracking model performance on this standardized benchmark over time
Develop an error analysis tool that categorizes OCR mistakes by character type (vowel, consonant, conjunct, matra) on benchmark results
Build an automated OCR regression testing pipeline that runs benchmarks on every model update

AI Use Cases

Marathi OCR model benchmarkingCross-script OCR performance comparisonStandardized evaluation for Indic language OCRRegression testing for OCR model updates
Last verified: 2026-03-12