DevChar - Extensive Dataset for Devanagari Character OCR

Large-scale handwritten Devanagari character dataset containing approximately 4 million character samples, explicitly designed to address the limitations of existing datasets that fail on text containing matras (vowel modifiers) and conjuncts (jodakshara). Covers characters with varying combinations of matras and conjunct forms that appear in real Marathi/Hindi text. The scale and explicit focus on matras and conjuncts makes this one of the most important datasets for training robust Devanagari OCR systems.

Pre-train a Devanagari character recognizer on DevChar's 4M samples for transfer to word-level Marathi OCR.

Homepage Paper

Quick Start

# DevChar dataset
# Paper: https://link.springer.com/chapter/10.1007/978-981-16-2911-2_13
# Contact authors for access to DevChar2020

print("DevChar: ~4M handwritten Devanagari character images")
print("Explicitly covers matras and conjunct characters")
print("Addresses key weakness of standard 46-class datasets")

Modality

Image (handwritten character crops with matra/conjunct labels)

Size

~4 million character images

License

Research use (DevChar2020)

Format

PNG/JPEG

Language

mr, hi

Update Frequency

static

Organization

Research community

Schema

Field	Type	Description
image	image	Handwritten Devanagari character image (may include matra modifiers)
character_label	string	Unicode character with matra/conjunct annotation
has_matra	boolean	Whether character includes a matra modifier
is_conjunct	boolean	Whether character is a conjunct form

Build With This

Create a matra error detector identifying which vowel modifier combinations are most problematic for OCR systems

Develop a conjunct-focused fine-tuning pipeline using DevChar's labeled conjunct samples for targeted OCR improvement

Build a Devanagari character difficulty ranker scoring character classes by OCR error rates to guide data collection priorities

AI Use Cases

Matra-aware Devanagari character recognitionConjunct character recognition at scaleRobust Devanagari OCR model pre-trainingCharacter-level error analysis for OCR systems

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12