MDIW-13 - Multi-Script Document Identification in the Wild

MH Subset Needed

Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.

Train a multi-script classifier that preprocesses documents by identifying Devanagari vs. Latin regions before running language-specific OCR.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage Download

Quick Start

# Download from Zenodo: https://zenodo.org/records/6376096
# Also available on IEEE DataPort
from PIL import Image

# Load multi-script document images with script annotations
print("MDIW-13: 1,135 documents, 86,655 words, 13 scripts")
print("ICDAR 2021 Script Identification in the Wild (SIW)")

Modality

Image (document pages with script identification annotations)

Size

1,135 documents; 13,979 text lines; 86,655 words; 13 scripts

License

Open access

Format

JPEG/PNG with word-level script labels

Language

hi, en, ar, bn, ja, ko, ta, te

Update Frequency

static

Organization

ICDAR 2021 SIW Competition

Schema

Field	Type	Description
document_image	image	Full document page image
word_bbox	json	Word bounding box coordinates
script_label	string	Script class label (Devanagari, Latin, Arabic, etc.)
text_line	json	Line-level bounding box with script label

Build With This

Create a script-aware OCR router that automatically selects Marathi or English OCR models based on detected script regions

Develop a bilingual document segmenter splitting mixed Marathi-English pages into script-homogeneous zones for targeted OCR

Build a script identification benchmark for Indian documents comparing deep learning approaches on the MDIW-13 Devanagari subset

AI Use Cases

Multi-script identification in documentsPreprocessing for bilingual Marathi-English OCRScript-aware document layout analysisMultilingual document routing and classification

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12