Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.
# Request access from CVIT, IIIT Hyderabad
# https://cvit.iiit.ac.in/usodi/tdocrmil.php
from PIL import Image
import os
# After downloading, load word images
img_dir = 'mozhi/marathi/words/'
for img_file in os.listdir(img_dir)[:5]:
img = Image.open(os.path.join(img_dir, img_file))
label = open(os.path.join(img_dir, img_file.replace('.png', '.txt'))).read()
print(f"{img_file}: {label}")| Field | Type | Description |
|---|---|---|
| image | image | Cropped word or line image from printed document |
| text | string | Unicode ground-truth transcription |
| language | string | Script/language of the text |
| level | string | Annotation level (word or line) |