Vision, OCR & Multimodal

Computer vision, optical character recognition, and multimodal datasets for Devanagari and Marathi content.

29 datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

Collection of 16 Indic language datasets from IIT Bombay hosted on IndiaAI's AIKOSH platform as part of the BharatGen initiative. Includes handwritten and printed Devanagari script images, scanned table recognition data, 78+ hours of multilingual audio, QA pairs, and math word problems. Covers Marathi plus 9 other Indian languages.

Build a multimodal Marathi document understanding system using AIKosh vision-language datasets.

multimodalOpen (IndiaAI Mission)

IIT Bombay / IndiaAI Mission

Bharat Scene Text Dataset (BSTD)

Large-scale scene text dataset for 11 Indian languages plus English, sourced from Wikimedia images of Indian signboards and street scenes. Includes 5,113 Marathi word annotations with polygon bounding boxes

Build a Devanagari scene text recognition system for reading Marathi shop signs and street nameplates in urban Maharashtra.

Image (scene text)Apache-2.0 (images: CC BY-SA 4.0)

IIIT Hyderabad

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

MH subset needed

Page-level handwritten OCR dataset for Indic scripts with both text detection and recognition annotations. Part of the PLATTER (Page-Level Handwritten Text Recognition) project. Unlike word-level datasets, CHIPS provides full-page handwritten document images with bounding box annotations for text regions plus Unicode transcriptions, enabling training end-to-end page-level OCR systems that handle detection and recognition jointly. Covers multiple Indic scripts including Devanagari.

Build a page-level Marathi handwritten OCR system that processes entire document pages without manual line segmentation.

Image (full-page handwritten documents with detection + recognition annotations)Research use

Multi-institutional (PLATTER project)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

MH subset needed

Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.

Build a bilingual script identifier that classifies words as Devanagari or Latin in mixed Marathi-English documents.

Image (handwritten mixed-script document pages with word-level annotations)Free for non-commercial research (CMATER lab, Jadavpur University)

CMATER Lab, Jadavpur University

COCO Captions Marathi

Marathi translations of MS-COCO image captions, verified by native Marathi speakers for linguistic accuracy and contextual integrity. Useful for training image captioning and cross-lingual retrieval models

Build a Marathi image captioning model that generates natural Marathi descriptions of photographs.

Text (caption pairs)CC BY-NC-ND 4.0

Microsoft COCO / AI4Bharat

CVQA

Culturally-diverse Multilingual Visual Question Answering benchmark with questions from 30 countries in 31 languages including Marathi. Images and questions are annotated by native speakers familiar with local culture

Build a culturally-aware Marathi visual question answering system that understands Indian visual contexts.

Image + TextCC BY-SA (varies per image)

CVQA Benchmark Authors

Devanagari Handwritten Character Dataset

Handwritten character image database with 46 classes (36 characters + 10 digits) of Devanagari script. Each grayscale image is 32x32 pixels. Applicable to Marathi character recognition since Marathi uses Devanagari

Build a Devanagari handwriting recognition model for digitizing handwritten Marathi documents and forms.

Image (handwritten characters)CC BY 4.0

UCI Machine Learning Repository

DevChar - Extensive Dataset for Devanagari Character OCR

Large-scale handwritten Devanagari character dataset containing approximately 4 million character samples, explicitly designed to address the limitations of existing datasets that fail on text containing matras (vowel modifiers) and conjuncts (jodakshara). Covers characters with varying combinations of matras and conjunct forms that appear in real Marathi/Hindi text. The scale and explicit focus on matras and conjuncts makes this one of the most important datasets for training robust Devanagari OCR systems.

Pre-train a Devanagari character recognizer on DevChar's 4M samples for transfer to word-level Marathi OCR.

Image (handwritten character crops with matra/conjunct labels)Research use (DevChar2020)

Research community

Digital Library of India - Scanned Marathi Books & Periodicals

MH subset needed MH

Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.

Build a semi-automated annotation pipeline to create OCR ground truth from DLI Marathi scans using existing OCR + human correction.

Image (scanned book/periodical pages, unannotated)Public domain / Out of copyright

Digital Library of India / Internet Archive

ICDAR MLT-2019

ICDAR 2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. Covers 10 languages across 7 scripts including Devanagari, applicable to Marathi scene text

Build a multilingual scene text detector that handles Devanagari alongside other scripts in Indian street scenes.

Image (scene text)Research use (competition)

ICDAR MLT Organizers

IIIT-HW-Dev

Large-scale Devanagari handwritten word dataset from CVIT, IIIT Hyderabad. Contains word-level images with corrected segmentation. Applicable to Marathi handwriting recognition as Marathi uses Devanagari script

Build a handwritten Marathi word recognition system for digitizing handwritten government documents and records.

Image (handwritten words)Research use (CVIT/IIIT-H)

IIIT Hyderabad

IIIT-INDIC-HW-WORDS - Large-Scale Handwritten Indic Words

MH subset needed

Massive handwritten word dataset containing 872,000 word instances across 10 Indic scripts including Devanagari, written by 135 writers. Each writer contributed approximately 6,460 word instances. Includes word-level bounding box annotations and Unicode transcriptions. The scale and writer diversity make this essential for training robust handwritten text recognition systems that generalize across writing styles. Devanagari subset directly applicable to Marathi handwriting recognition.

Train a writer-independent Marathi handwriting recognizer using the Devanagari subset of this large-scale dataset.

Image (handwritten word crops with transcriptions)Research use (CVIT/IIIT-H)

CVIT, IIIT Hyderabad

IndicDLP - Indic Document Layout Parsing

MH subset needed

Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.

Build a Marathi document structure analyzer that detects paragraphs, tables, headings, and figures for automated digitization.

Image (document pages with layout annotations)CC BY 4.0

AI4Bharat / IIT Madras

IndicSTR12

Indic Scene Text Recognition dataset covering 12 major Indian languages including Marathi. Word images collected from natural scenes such as signboards, shop nameplates, railway stations, advertisements, and banners

Build a robust Devanagari scene text recognition system for reading Marathi text in natural images.

Image (scene text)Research use (CVIT/IIIT-H)

AI4Bharat, IIT Madras

ISIDCHAR - ISI Kolkata Devanagari Character Database

MH subset needed

Handwritten Devanagari character dataset from the CVPR Unit at Indian Statistical Institute (ISI), Kolkata. Contains 36,172 grayscale character images across 47 character classes covering all basic Devanagari consonants, vowels, and numerals. Collected from multiple writers with natural handwriting variation. One of the earliest and most cited Indian script character recognition benchmark datasets. Also includes a separate ISI Devanagari Numeral Database with 22,556 numeral images from 1,049 writers.

Benchmark modern deep learning character classifiers against this classic ISI Kolkata dataset.

Image (handwritten character crops)Research use (ISI Kolkata)

CVPR Unit, Indian Statistical Institute (ISI), Kolkata

MahaBhulekh - Maharashtra 7/12 Satbara Land Record Extracts

MH subset needed MH

Maharashtra's digitized land record system containing 2.11 crore (21.1 million) 7/12 satbara extracts across 358 talukas. Each extract follows a standardized Marathi template with structured fields for survey number, land area, landowner details, crop information, and encumbrances. The records are dynamically generated in Marathi and represent one of the largest standardized Marathi document sources available. Raw unannotated source requiring OCR ground-truth annotation, but the consistent template format makes automated annotation feasible. A community scraping tool exists for aggregation.

Build an annotation pipeline converting MahaBhulekh 7/12 extracts into field-level OCR training data with bounding boxes and transcriptions.

Document images (standardized Marathi form template, unannotated)Gov Open

Maharashtra Revenue Department, Government of Maharashtra

Marathi Handwritten Text Dataset

2,500+ images of full handwritten Marathi text (sentences and paragraphs, not isolated characters). Native speakers wrote pre-designed text covering nearly all Marathi characters, words, and diacritical marks. Fills the gap between character-level datasets (like Devanagari HWR) and real-world handwritten text recognition (HTR).

Build an end-to-end handwritten Marathi text recognition system for digitizing handwritten documents and forms.

imageCC0 (Public Domain)

Independent researcher (Kaggle)

Marathi-OCR-Dataset

Collection of ~12K Marathi word images with corresponding UTF-8 text labels, sourced from 12 Marathi books across various genres. Images are binarized, thresholded, and resized to 96 dpi for direct neural network input

Build a production-grade Marathi OCR engine for digitizing printed government documents and books.

Image (printed text)Not specified

IIT Bombay / IIIT Hyderabad

MDIW-13 - Multi-Script Document Identification in the Wild

MH subset needed

Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.

Train a multi-script classifier that preprocesses documents by identifying Devanagari vs. Latin regions before running language-specific OCR.

Image (document pages with script identification annotations)Open access

ICDAR 2021 SIW Competition

MKI-26 Devanagari Handwritten Characters with Compound Characters

Handwritten Devanagari character dataset specifically designed to include compound/conjunct characters (jodakshara) alongside basic characters. Contains 36,000 images across 60 classes (10 numerals, 13 vowels, 17 similar-looking consonants, and 20 compound character classes) with 600 balanced images per class. One of the few publicly available datasets that explicitly addresses conjunct character recognition — a major challenge for Marathi/Devanagari OCR where characters like क्ष, ज्ञ, त्र merge into single glyphs. Achieves 99.66% accuracy with CNN 2D.

Build a Devanagari character recognizer that handles both basic and compound characters for robust Marathi OCR.

Image (handwritten character crops with class labels)Research use

Research community

MLRD-20: Marathi Lip Reading Dataset

Marathi lip reading dataset containing video recordings of speakers pronouncing Marathi words and phrases, designed for visual speech recognition and lip-reading AI systems. One of the few lip-reading datasets for any Indian language.

Build a Marathi lip reading model for silent speech recognition in noisy environments or for hearing-impaired users.

videoUnknown

Independent researcher (Kaggle)

MoDeTrans - Modi Script Document Transliteration Dataset

Dataset of 2,043 historical Modi script document images paired with Devanagari transliterations. Modi was the official script for writing Marathi from the 12th century until the British colonial period when Devanagari replaced it. This dataset enables training vision-language models (MoScNet architecture) to transliterate Modi documents into modern Devanagari, unlocking centuries of Marathi historical records including Peshwa-era administrative documents, Shivaji Maharaj's correspondence, and Maratha empire legal records.

Build a Modi-to-Devanagari transliteration tool that makes historical Marathi documents readable to modern Marathi speakers.

Image (historical document scans with Devanagari transliterations)Research use

Research community

MODI-HChar - Historical Modi Script Handwritten Character Dataset

Large-scale handwritten Modi script character dataset containing 575,920 character images across 57 classes (10 numerals, 12 vowels, 35 consonants). Modi script was the primary writing system for Marathi from the 12th to 20th century. This character-level dataset enables training classifiers for the foundational character recognition stage of historical Marathi document OCR. The scale (575K images) provides sufficient variety for robust recognition across different historical writing styles.

Train a Modi script character classifier achieving high accuracy across all 57 character classes.

Image (handwritten character crops)CC BY 4.0

Research community

MODI-HHDoc - Historical Modi Script Handwritten Document Dataset

Collection of 3,350 handwritten historical Modi script document images for document-level recognition research. Modi script was used for writing Marathi for over 700 years and vast archives of administrative, legal, and literary documents remain undigitized. This dataset provides full-page document scans suitable for training document-level detection and recognition models for historical Marathi manuscripts.

Build a historical Modi script document preprocessing pipeline handling binarization, noise removal, and line segmentation.

Image (historical handwritten document scans)CC BY 4.0

Research community

Mozhi - Printed Document OCR Dataset

MH subset needed

Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.

Train a production-grade Marathi printed text recognizer using Mozhi's large-scale word image corpus.

Image (printed text word/line crops with transcriptions)Research use (NLTM/Bhashini)

CVIT, IIIT Hyderabad / NLTM-Bhashini

Roboflow Indian Identity Document Detection Datasets

Collection of annotated Indian identity document datasets on Roboflow Universe covering Aadhaar cards (2,645 images with field-level bounding boxes), Voter ID cards (1,274 images), PAN cards, and Driving Licenses. Annotations include object detection bounding boxes for key fields (name, number, date of birth, address, photo, gender). While not Marathi-specific, many documents contain Devanagari text fields. These are among the few publicly available annotated Indian government document datasets suitable for training field detection and extraction models.

Build a multi-document Indian KYC processor that detects document type and extracts key fields from identity documents.

Image (identity documents with bounding box annotations)Varies by dataset (check individual Roboflow pages)

Community (Roboflow Universe)

RoundTripOCR - Post-OCR Error Correction Dataset

Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.

Train a Marathi OCR post-processor that automatically corrects common recognition errors in Devanagari text.

Text (parallel noisy-clean sentence pairs)Not specified

IIT Delhi

Sanskrit Letter Dataset (602 Character Classes)

Handwritten Devanagari character dataset with the widest class diversity available — 602 character classes covering basic vowels, consonants, modifiers, AND hundreds of conjunct/compound character combinations found in Sanskrit texts. Contains 7,702 images (~12.8 per class). While the per-class sample count is low, the class inventory is invaluable as a reference for which conjuncts actually appear in real Devanagari text. Many Sanskrit conjuncts carry over into Marathi vocabulary (e.g., विद्या, संस्कृत, शास्त्र). Essential for building comprehensive conjunct recognition models.

Use the 602-class inventory to build a comprehensive conjunct coverage test suite for Marathi OCR evaluation.

Image (handwritten character crops)Research use

Research community (DAS 2018)

TrOCR Marathi Printed Text Dataset

Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.

Fine-tune a TrOCR model on this Marathi dataset for high-accuracy printed Devanagari text recognition.

Image (printed text line/word crops with transcriptions)Not specified

Community