CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

MH Subset Needed

Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.

Build a bilingual script identifier that classifies words as Devanagari or Latin in mixed Marathi-English documents.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage Paper

Quick Start

# CMATERdb Devanagari-Roman mixed-script dataset
# Download from: https://code.google.com/archive/p/cmaterdb/
from PIL import Image
import os

# Load mixed-script document pages
# Each page contains both Devanagari and Roman text
print("CMATERdb: 150 mixed-script pages, 25K+ annotated words")
print("Script ID accuracy baseline: 95.30%")

Modality

Image (handwritten mixed-script document pages with word-level annotations)

Size

150 document pages; 15,528 Devanagari + 10,331 Roman word images

License

Free for non-commercial research (CMATER lab, Jadavpur University)

Format

JPEG/PNG with word-level annotations

Language

hi, en

Update Frequency

static

Organization

CMATER Lab, Jadavpur University

Schema

Field	Type	Description
page_image	image	Full-page mixed-script handwritten document scan
word_image	image	Cropped word image
script_label	string	Script identifier (Devanagari or Roman)
word_bbox	json	Word bounding box coordinates

Build With This

Create a Marathi-English mixed-document OCR pipeline using CMATERdb for script identification training

Develop a synthetic mixed-script document generator creating realistic bilingual Marathi-English pages at scale

Build a bilingual handwriting recognition system handling both Devanagari and Latin scripts in the same document

AI Use Cases

Bilingual Devanagari-English script identificationMixed-script document OCR pipeline trainingHandwritten mixed-script word segmentationCode-mixed document processing

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

COCO Captions Marathi

Text (caption pairs)

Last verified: 2026-03-12