Open-source synthetic OCR dataset generation tool validated on Devanagari and other low-resource scripts. Generates unlimited synthetic training images by rendering text from Unicode corpora using diverse TTF/OTF fonts with configurable degradation effects (blur, noise, skew, low DPI, ink bleed, paper texture). Combined with Marathi text corpora and 283+ Devanagari fonts, this tool can generate millions of labeled training images for printed Marathi OCR without manual annotation. Essential for data augmentation in low-resource OCR scenarios.
# SynthOCR-Gen for Devanagari synthetic data generation
# Paper: https://arxiv.org/abs/2601.16113
# Also consider: TextRecognitionDataGenerator (TRDG)
# https://github.com/Belval/TextRecognitionDataGenerator
# Requirements: Marathi text corpus + Devanagari fonts
# Text: L3Cube MahaCorpus (24.8M sentences)
# Fonts: https://fonts.google.com/?subset=devanagari (50+)
# https://devanagarifonts.net/ (283+ fonts)
print("Generate unlimited Marathi OCR training data synthetically")| Field | Type | Description |
|---|---|---|
| image | image | Synthetically rendered text image |
| text | string | Ground-truth text used for rendering |
| font | string | Font name used for rendering |
| degradation | string | Applied degradation effects (blur, noise, skew, etc.) |