MDIW-13 - Multi-Script Document Identification in the Wild

MDIW-13 - Multi-Script Document Identification in the Wild

MH Subset Needed

Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.

Train a multi-script classifier that preprocesses documents by identifying Devanagari vs. Latin regions before running language-specific OCR.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.
HomepageDownload

Quick Start

# Download from Zenodo: https://zenodo.org/records/6376096
# Also available on IEEE DataPort
from PIL import Image

# Load multi-script document images with script annotations
print("MDIW-13: 1,135 documents, 86,655 words, 13 scripts")
print("ICDAR 2021 Script Identification in the Wild (SIW)")
Modality
Image (document pages with script identification annotations)
Size
1,135 documents; 13,979 text lines; 86,655 words; 13 scripts
License
Format
JPEG/PNG with word-level script labels
Language
hi, en, ar, bn, ja, ko, ta, te
Update Frequency
static
Organization
ICDAR 2021 SIW Competition

Schema

FieldTypeDescription
document_imageimageFull document page image
word_bboxjsonWord bounding box coordinates
script_labelstringScript class label (Devanagari, Latin, Arabic, etc.)
text_linejsonLine-level bounding box with script label

Build With This

Create a script-aware OCR router that automatically selects Marathi or English OCR models based on detected script regions
Develop a bilingual document segmenter splitting mixed Marathi-English pages into script-homogeneous zones for targeted OCR
Build a script identification benchmark for Indian documents comparing deep learning approaches on the MDIW-13 Devanagari subset

AI Use Cases

Multi-script identification in documentsPreprocessing for bilingual Marathi-English OCRScript-aware document layout analysisMultilingual document routing and classification
Last verified: 2026-03-12