Page-level handwritten OCR dataset for Indic scripts with both text detection and recognition annotations. Part of the PLATTER (Page-Level Handwritten Text Recognition) project. Unlike word-level datasets, CHIPS provides full-page handwritten document images with bounding box annotations for text regions plus Unicode transcriptions, enabling training end-to-end page-level OCR systems that handle detection and recognition jointly. Covers multiple Indic scripts including Devanagari.
# PLATTER project - page-level handwritten OCR
# Paper: https://arxiv.org/abs/2502.06172
# Contact authors for dataset access
print("CHIPS: Page-level handwritten Indic OCR dataset")
print("Supports detection + recognition jointly")
print("Filter Devanagari script pages for Marathi OCR")| Field | Type | Description |
|---|---|---|
| image | image | Full-page handwritten document scan |
| text_regions | json | Bounding box coordinates for text line regions |
| transcriptions | array | Unicode transcriptions for each detected text region |
| script | string | Script identifier |
| writer_id | string | Writer identifier |