MODI-HHDoc - Historical Modi Script Handwritten Document Dataset

MODI-HHDoc - Historical Modi Script Handwritten Document Dataset

MH Specific

Collection of 3,350 handwritten historical Modi script document images for document-level recognition research. Modi script was used for writing Marathi for over 700 years and vast archives of administrative, legal, and literary documents remain undigitized. This dataset provides full-page document scans suitable for training document-level detection and recognition models for historical Marathi manuscripts.

Build a historical Modi script document preprocessing pipeline handling binarization, noise removal, and line segmentation.
HomepageDownload

Quick Start

# Download from https://data.mendeley.com/datasets/sg337vf6wn/1
# Also available on IEEE DataPort
from PIL import Image
import os

img_dir = 'modi_hhdoc/'
images = [f for f in os.listdir(img_dir) if f.endswith(('.jpg', '.png'))]
print(f"Historical Modi documents: {len(images)}")
img = Image.open(os.path.join(img_dir, images[0]))
print(f"Image size: {img.size}")
Modality
Image (historical handwritten document scans)
Size
3,350 document images
License
Format
JPEG/PNG
Language
mr
Update Frequency
static
Organization
Research community

Schema

FieldTypeDescription
imageimageFull-page historical Modi script document scan
document_idstringUnique document identifier

Build With This

Create a Modi script document digitization pipeline combining image enhancement, line segmentation, and character recognition
Develop an archive management system for historical Marathi documents with automatic cataloguing and indexing
Build a document age and condition estimator for historical Marathi manuscripts to prioritize conservation efforts

AI Use Cases

Historical Modi script document recognitionMarathi manuscript digitization pipeline developmentDocument image binarization and enhancement for degraded manuscriptsHistorical script detection and segmentation
Last verified: 2026-03-12