DevChar - Extensive Dataset for Devanagari Character OCR

DevChar - Extensive Dataset for Devanagari Character OCR

Large-scale handwritten Devanagari character dataset containing approximately 4 million character samples, explicitly designed to address the limitations of existing datasets that fail on text containing matras (vowel modifiers) and conjuncts (jodakshara). Covers characters with varying combinations of matras and conjunct forms that appear in real Marathi/Hindi text. The scale and explicit focus on matras and conjuncts makes this one of the most important datasets for training robust Devanagari OCR systems.

Pre-train a Devanagari character recognizer on DevChar's 4M samples for transfer to word-level Marathi OCR.
HomepagePaper

Quick Start

# DevChar dataset
# Paper: https://link.springer.com/chapter/10.1007/978-981-16-2911-2_13
# Contact authors for access to DevChar2020

print("DevChar: ~4M handwritten Devanagari character images")
print("Explicitly covers matras and conjunct characters")
print("Addresses key weakness of standard 46-class datasets")
Modality
Image (handwritten character crops with matra/conjunct labels)
Size
~4 million character images
License
Format
PNG/JPEG
Language
mr, hi
Update Frequency
static
Organization
Research community

Schema

FieldTypeDescription
imageimageHandwritten Devanagari character image (may include matra modifiers)
character_labelstringUnicode character with matra/conjunct annotation
has_matrabooleanWhether character includes a matra modifier
is_conjunctbooleanWhether character is a conjunct form

Build With This

Create a matra error detector identifying which vowel modifier combinations are most problematic for OCR systems
Develop a conjunct-focused fine-tuning pipeline using DevChar's labeled conjunct samples for targeted OCR improvement
Build a Devanagari character difficulty ranker scoring character classes by OCR error rates to guide data collection priorities

AI Use Cases

Matra-aware Devanagari character recognitionConjunct character recognition at scaleRobust Devanagari OCR model pre-trainingCharacter-level error analysis for OCR systems
Last verified: 2026-03-12