Mozhi - Printed Document OCR Dataset

MH Subset Needed

Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.

Train a production-grade Marathi printed text recognizer using Mozhi's large-scale word image corpus.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage Paper

Quick Start

# Request access from CVIT, IIIT Hyderabad
# https://cvit.iiit.ac.in/usodi/tdocrmil.php
from PIL import Image
import os

# After downloading, load word images
img_dir = 'mozhi/marathi/words/'
for img_file in os.listdir(img_dir)[:5]:
    img = Image.open(os.path.join(img_dir, img_file))
    label = open(os.path.join(img_dir, img_file.replace('.png', '.txt'))).read()
    print(f"{img_file}: {label}")

Modality

Image (printed text word/line crops with transcriptions)

Size

1.2M+ word images; ~120K line images; 13 languages

License

Research use (NLTM/Bhashini)

Format

PNG/JPEG with text labels

Language

mr, hi, en, bn, ta, te, kn, ml, gu, pa, or, as, ur

Update Frequency

static

Organization

CVIT, IIIT Hyderabad / NLTM-Bhashini

Schema

Field	Type	Description
image	image	Cropped word or line image from printed document
text	string	Unicode ground-truth transcription
language	string	Script/language of the text
level	string	Annotation level (word or line)

Build With This

Create a Marathi book digitization pipeline combining Mozhi-trained recognizer with layout analysis

Develop a cross-script transfer learning study comparing OCR performance when pre-training on Hindi vs. directly training on Marathi

Build a font-robust Marathi OCR system by augmenting Mozhi data with synthetic multi-font renderings

AI Use Cases

Printed Marathi text recognition model trainingMulti-script OCR system developmentTransfer learning from Hindi to Marathi OCRDocument digitization pipeline training

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12