CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

MH Subset Needed

Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.

Build a bilingual script identifier that classifies words as Devanagari or Latin in mixed Marathi-English documents.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.
HomepagePaper

Quick Start

# CMATERdb Devanagari-Roman mixed-script dataset
# Download from: https://code.google.com/archive/p/cmaterdb/
from PIL import Image
import os

# Load mixed-script document pages
# Each page contains both Devanagari and Roman text
print("CMATERdb: 150 mixed-script pages, 25K+ annotated words")
print("Script ID accuracy baseline: 95.30%")
Modality
Image (handwritten mixed-script document pages with word-level annotations)
Size
150 document pages; 15,528 Devanagari + 10,331 Roman word images
License
Format
JPEG/PNG with word-level annotations
Language
hi, en
Update Frequency
static
Organization
CMATER Lab, Jadavpur University

Schema

FieldTypeDescription
page_imageimageFull-page mixed-script handwritten document scan
word_imageimageCropped word image
script_labelstringScript identifier (Devanagari or Roman)
word_bboxjsonWord bounding box coordinates

Build With This

Create a Marathi-English mixed-document OCR pipeline using CMATERdb for script identification training
Develop a synthetic mixed-script document generator creating realistic bilingual Marathi-English pages at scale
Build a bilingual handwriting recognition system handling both Devanagari and Latin scripts in the same document

AI Use Cases

Bilingual Devanagari-English script identificationMixed-script document OCR pipeline trainingHandwritten mixed-script word segmentationCode-mixed document processing
Last verified: 2026-03-12