SynthOCR-Gen - Synthetic OCR Data Generator for Devanagari

SynthOCR-Gen - Synthetic OCR Data Generator for Devanagari

Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.

Generate 1M+ synthetic Marathi word images using L3Cube MahaCorpus text and 283 Devanagari fonts.
HomepagePaper

Quick Start

# SynthOCR-Gen for Devanagari synthetic data generation
# Paper: https://arxiv.org/abs/2601.16113
# Also consider: TextRecognitionDataGenerator (TRDG)
# https://github.com/Belval/TextRecognitionDataGenerator

# Requirements: Marathi text corpus + Devanagari fonts
# Text: L3Cube MahaCorpus (24.8M sentences)
# Fonts: https://fonts.google.com/?subset=devanagari (50+)
#        https://devanagarifonts.net/ (283+ fonts)
print("Generate unlimited Marathi OCR training data synthetically")
Modality
Tool (synthetic image generator)
Size
Unlimited generation capacity; requires input text corpus + fonts
License
Format
Tool (generates PNG/JPEG + text labels)
Language
mr, hi, ne, sa
Update Frequency
static
Organization
Research community

Schema

FieldTypeDescription
imageimageSynthetically rendered text image
textstringGround-truth text used for rendering
fontstringFont name used for rendering
degradationstringApplied degradation effects (blur, noise, skew, etc.)

Build With This

Create a Marathi OCR training data factory combining SynthOCR-Gen with real annotated data for optimal model performance
Develop a targeted synthetic data pipeline that oversamples rare Marathi conjunct characters to improve recognition of hard cases
Build an automated curriculum learning system that generates progressively harder synthetic examples as the OCR model improves

AI Use Cases

Synthetic training data generation for Marathi OCRFont diversity augmentation for robust text recognitionDegradation simulation for historical document OCRLow-resource OCR data bootstrapping
Last verified: 2026-03-12