Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.
# CMATERdb Devanagari-Roman mixed-script dataset
# Download from: https://code.google.com/archive/p/cmaterdb/
from PIL import Image
import os
# Load mixed-script document pages
# Each page contains both Devanagari and Roman text
print("CMATERdb: 150 mixed-script pages, 25K+ annotated words")
print("Script ID accuracy baseline: 95.30%")| Field | Type | Description |
|---|---|---|
| page_image | image | Full-page mixed-script handwritten document scan |
| word_image | image | Cropped word image |
| script_label | string | Script identifier (Devanagari or Roman) |
| word_bbox | json | Word bounding box coordinates |