Marathi-OCR-Dataset

MH Specific

Collection of ~12K Marathi word images with corresponding UTF-8 text labels, sourced from 12 Marathi books across various genres. Images are binarized, thresholded, and resized to 96 dpi for direct neural network input

Build a production-grade Marathi OCR engine for digitizing printed government documents and books.

Homepage GitHub

Quick Start

# Marathi OCR Dataset
import cv2
print('Marathi OCR Dataset')
print('Printed Devanagari text images with ground truth')

Modality

Image (printed text)

Size

~12K word images

License

Not specified

Format

PNG/JPEG

Language

Update Frequency

static

Organization

IIT Bombay / IIIT Hyderabad

Schema

Field	Type	Description
image	image	Document image containing Marathi text
text	string	OCR ground truth text

Build With This

Create a Marathi document scanning app that extracts text from photographed pages for search and editing

Develop an automated Marathi form processor that extracts field values from scanned government forms

Build a Marathi textbook digitizer for converting printed educational materials to accessible digital formats

AI Use Cases

Devanagari OCRprinted text recognition

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-07