MODI-HChar - Historical Modi Script Handwritten Character Dataset

MH Specific

Large-scale handwritten Modi script character dataset containing 575,920 character images across 57 classes (10 numerals, 12 vowels, 35 consonants). Modi script was the primary writing system for Marathi from the 12th to 20th century. This character-level dataset enables training classifiers for the foundational character recognition stage of historical Marathi document OCR. The scale (575K images) provides sufficient variety for robust recognition across different historical writing styles.

Train a Modi script character classifier achieving high accuracy across all 57 character classes.

Homepage

Quick Start

# Download from IEEE DataPort or Kaggle
# https://www.kaggle.com/datasets/msd6013/modi-hdc-historical-handwritten-modi-script
import torch
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((32, 32)),
    transforms.ToTensor()
])
# Load as image folder dataset (57 class subdirectories)
dataset = datasets.ImageFolder('modi_hchar/', transform=transform)
print(f"Total images: {len(dataset)}, Classes: {len(dataset.classes)}")

Modality

Image (handwritten character crops)

Size

575,920 character images; 57 classes

License

CC BY 4.0

Format

PNG/JPEG

Language

Update Frequency

static

Organization

Research community

Schema

Field	Type	Description
image	image	Cropped handwritten Modi script character image
character_class	string	Modi character label (vowel, consonant, or numeral)
class_id	int	Numeric class identifier (0-56)

Build With This

Create a Modi-Devanagari character mapping tool that visually maps Modi characters to their Devanagari equivalents

Develop a Modi script handwriting recognition model combining character classification with word-level context

Build a historical document transcription assistant that segments Modi text and classifies individual characters

AI Use Cases

Modi script character classificationHistorical Marathi OCR character recognition stageScript identification (Modi vs. Devanagari)Historical handwriting analysis

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12