Mozhi - Printed Document OCR Dataset

Mozhi - Printed Document OCR Dataset

MH Subset Needed

Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.

Train a production-grade Marathi printed text recognizer using Mozhi's large-scale word image corpus.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.
HomepagePaper

Quick Start

# Request access from CVIT, IIIT Hyderabad
# https://cvit.iiit.ac.in/usodi/tdocrmil.php
from PIL import Image
import os

# After downloading, load word images
img_dir = 'mozhi/marathi/words/'
for img_file in os.listdir(img_dir)[:5]:
    img = Image.open(os.path.join(img_dir, img_file))
    label = open(os.path.join(img_dir, img_file.replace('.png', '.txt'))).read()
    print(f"{img_file}: {label}")
Modality
Image (printed text word/line crops with transcriptions)
Size
1.2M+ word images; ~120K line images; 13 languages
License
Format
PNG/JPEG with text labels
Language
mr, hi, en, bn, ta, te, kn, ml, gu, pa, or, as, ur
Update Frequency
static
Organization
CVIT, IIIT Hyderabad / NLTM-Bhashini

Schema

FieldTypeDescription
imageimageCropped word or line image from printed document
textstringUnicode ground-truth transcription
languagestringScript/language of the text
levelstringAnnotation level (word or line)

Build With This

Create a Marathi book digitization pipeline combining Mozhi-trained recognizer with layout analysis
Develop a cross-script transfer learning study comparing OCR performance when pre-training on Hindi vs. directly training on Marathi
Build a font-robust Marathi OCR system by augmenting Mozhi data with synthetic multi-font renderings

AI Use Cases

Printed Marathi text recognition model trainingMulti-script OCR system developmentTransfer learning from Hindi to Marathi OCRDocument digitization pipeline training
Last verified: 2026-03-12