Large-scale handwritten Modi script character dataset containing 575,920 character images across 57 classes (10 numerals, 12 vowels, 35 consonants). Modi script was the primary writing system for Marathi from the 12th to 20th century. This character-level dataset enables training classifiers for the foundational character recognition stage of historical Marathi document OCR. The scale (575K images) provides sufficient variety for robust recognition across different historical writing styles.
# Download from IEEE DataPort or Kaggle
# https://www.kaggle.com/datasets/msd6013/modi-hdc-historical-handwritten-modi-script
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.Grayscale(),
transforms.Resize((32, 32)),
transforms.ToTensor()
])
# Load as image folder dataset (57 class subdirectories)
dataset = datasets.ImageFolder('modi_hchar/', transform=transform)
print(f"Total images: {len(dataset)}, Classes: {len(dataset.classes)}")| Field | Type | Description |
|---|---|---|
| image | image | Cropped handwritten Modi script character image |
| character_class | string | Modi character label (vowel, consonant, or numeral) |
| class_id | int | Numeric class identifier (0-56) |