Standardized synthetic OCR benchmark dataset containing 90,000 images with ground truth across 23 Indic languages including Marathi. Designed for evaluating and comparing OCR system performance across Indian scripts in a controlled setting. Uses synthetic rendering with known ground truth to provide clean evaluation metrics. Includes varied font sizes, styles, and text lengths. Essential for benchmarking Marathi OCR models against other Indic language results and tracking progress toward SOTA.
# OCR Synthetic Benchmark for Indic Languages
# Paper: https://arxiv.org/abs/2205.02543
from PIL import Image
# Load Marathi subset of the benchmark
# Evaluate OCR models using Character Error Rate (CER) and Word Error Rate (WER)
print("Benchmark: 90K synthetic images across 23 Indic languages")
print("Metrics: CER (Character Error Rate), WER (Word Error Rate)")| Field | Type | Description |
|---|---|---|
| image | image | Synthetically rendered text image |
| text | string | Ground-truth text |
| language | string | Language/script identifier |
| font_size | int | Font size used for rendering |