SynthOCR-Gen - Synthetic OCR Data Generator for Devanagari

Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.

Generate 1M+ synthetic Marathi word images using L3Cube MahaCorpus text and 283 Devanagari fonts.

Homepage Paper

Quick Start

# SynthOCR-Gen for Devanagari synthetic data generation
# Paper: https://arxiv.org/abs/2601.16113
# Also consider: TextRecognitionDataGenerator (TRDG)
# https://github.com/Belval/TextRecognitionDataGenerator

# Requirements: Marathi text corpus + Devanagari fonts
# Text: L3Cube MahaCorpus (24.8M sentences)
# Fonts: https://fonts.google.com/?subset=devanagari (50+)
#        https://devanagarifonts.net/ (283+ fonts)
print("Generate unlimited Marathi OCR training data synthetically")

Modality

Tool (synthetic image generator)

Size

Unlimited generation capacity; requires input text corpus + fonts

License

Open source

Format

Tool (generates PNG/JPEG + text labels)

Language

mr, hi, ne, sa

Update Frequency

static

Organization

Research community

Schema

Field	Type	Description
image	image	Synthetically rendered text image
text	string	Ground-truth text used for rendering
font	string	Font name used for rendering
degradation	string	Applied degradation effects (blur, noise, skew, etc.)

Build With This

Create a Marathi OCR training data factory combining SynthOCR-Gen with real annotated data for optimal model performance

Develop a targeted synthetic data pipeline that oversamples rare Marathi conjunct characters to improve recognition of hard cases

Build an automated curriculum learning system that generates progressively harder synthetic examples as the OCR model improves

AI Use Cases

Synthetic training data generation for Marathi OCRFont diversity augmentation for robust text recognitionDegradation simulation for historical document OCRLow-resource OCR data bootstrapping

Related Datasets

BiasShades Marathi (LLM Bias Evaluation)

text

FLORES-200 Benchmark

Text (parallel, Marathi)

Google Fonts Devanagari Collection

Font files (TTF/OTF)

Indic NLP Library

Tools (Python)

Last verified: 2026-03-12