Marathi-OCR-Dataset

Marathi-OCR-Dataset

MH Specific

Collection of ~12K Marathi word images with corresponding UTF-8 text labels, sourced from 12 Marathi books across various genres. Images are binarized, thresholded, and resized to 96 dpi for direct neural network input

Build a production-grade Marathi OCR engine for digitizing printed government documents and books.
HomepageGitHub

Quick Start

# Marathi OCR Dataset
import cv2
print('Marathi OCR Dataset')
print('Printed Devanagari text images with ground truth')
Modality
Image (printed text)
Size
~12K word images
License
Format
PNG/JPEG
Language
mr
Update Frequency
static
Organization
IIT Bombay / IIIT Hyderabad

Schema

FieldTypeDescription
imageimageDocument image containing Marathi text
textstringOCR ground truth text

Build With This

Create a Marathi document scanning app that extracts text from photographed pages for search and editing
Develop an automated Marathi form processor that extracts field values from scanned government forms
Build a Marathi textbook digitizer for converting printed educational materials to accessible digital formats

AI Use Cases

Devanagari OCRprinted text recognition
Last verified: 2026-03-07