Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.
# Download from Zenodo: https://zenodo.org/records/6376096
# Also available on IEEE DataPort
from PIL import Image
# Load multi-script document images with script annotations
print("MDIW-13: 1,135 documents, 86,655 words, 13 scripts")
print("ICDAR 2021 Script Identification in the Wild (SIW)")| Field | Type | Description |
|---|---|---|
| document_image | image | Full document page image |
| word_bbox | json | Word bounding box coordinates |
| script_label | string | Script class label (Devanagari, Latin, Arabic, etc.) |
| text_line | json | Line-level bounding box with script label |