IIIT-INDIC-HW-WORDS - Large-Scale Handwritten Indic Words

MH Subset Needed

Massive handwritten word dataset containing 872,000 word instances across 10 Indic scripts including Devanagari, written by 135 writers. Each writer contributed approximately 6,460 word instances. Includes word-level bounding box annotations and Unicode transcriptions. The scale and writer diversity make this essential for training robust handwritten text recognition systems that generalize across writing styles. Devanagari subset directly applicable to Marathi handwriting recognition.

Train a writer-independent Marathi handwriting recognizer using the Devanagari subset of this large-scale dataset.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage

Quick Start

# Request access from CVIT, IIIT Hyderabad
# https://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-indic-hw-words
from PIL import Image

# After downloading, filter Devanagari (applicable to Marathi)
# devanagari_words/ contains word images with transcription files
print("IIIT-INDIC-HW-WORDS: 872K handwritten word instances")
print("Filter Devanagari script for Marathi OCR training")

Modality

Image (handwritten word crops with transcriptions)

Size

872K word instances; 135 writers; 10 Indic scripts

License

Research use (CVIT/IIIT-H)

Format

PNG/JPEG with text labels

Language

mr, hi, bn, ta, te, kn, ml, gu, pa, or

Update Frequency

static

Organization

CVIT, IIIT Hyderabad

Schema

Field	Type	Description
image	image	Cropped handwritten word image
text	string	Unicode ground-truth transcription
script	string	Script identifier (Devanagari for Marathi/Hindi)
writer_id	string	Unique writer identifier

Build With This

Create a handwriting-to-text API for Marathi documents using IIIT-INDIC-HW-WORDS as the primary training corpus

Develop a writer identification system for Marathi handwriting that can distinguish between 100+ writing styles

Build a handwriting difficulty analyzer scoring legibility of Marathi handwritten samples for quality control

AI Use Cases

Large-scale handwritten Marathi word recognitionWriter-independent handwriting model trainingCross-script transfer learning for handwritten OCRHandwriting style analysis and generation

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12