Vision, OCR & Multimodal

Computer vision, optical character recognition, and multimodal datasets for Devanagari and Marathi content.

29 datasets

Collection of 16 Indic language datasets from IIT Bombay hosted on IndiaAI's AIKOSH platform as part of the BharatGen initiative. Includes handwritten and printed Devanagari script images, scanned table recognition data, 78+ hours of multilingual audio, QA pairs, and math word problems. Covers Marathi plus 9 other Indian languages.

Build a multimodal Marathi document understanding system using AIKosh vision-language datasets.
multimodal
IIT Bombay / IndiaAI Mission

Large-scale scene text dataset for 11 Indian languages plus English, sourced from Wikimedia images of Indian signboards and street scenes. Includes 5,113 Marathi word annotations with polygon bounding boxes

Build a Devanagari scene text recognition system for reading Marathi shop signs and street nameplates in urban Maharashtra.
Image (scene text)
IIIT Hyderabad

Page-level handwritten OCR dataset for Indic scripts with both text detection and recognition annotations. Part of the PLATTER (Page-Level Handwritten Text Recognition) project. Unlike word-level datasets, CHIPS provides full-page handwritten document images with bounding box annotations for text regions plus Unicode transcriptions, enabling training end-to-end page-level OCR systems that handle detection and recognition jointly. Covers multiple Indic scripts including Devanagari.

Build a page-level Marathi handwritten OCR system that processes entire document pages without manual line segmentation.
Image (full-page handwritten documents with detection + recognition annotations)
Multi-institutional (PLATTER project)

Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.

Build a bilingual script identifier that classifies words as Devanagari or Latin in mixed Marathi-English documents.
Image (handwritten mixed-script document pages with word-level annotations)
CMATER Lab, Jadavpur University

Marathi translations of MS-COCO image captions, verified by native Marathi speakers for linguistic accuracy and contextual integrity. Useful for training image captioning and cross-lingual retrieval models

Build a Marathi image captioning model that generates natural Marathi descriptions of photographs.
Text (caption pairs)
Microsoft COCO / AI4Bharat
CVQA
MH

Culturally-diverse Multilingual Visual Question Answering benchmark with questions from 30 countries in 31 languages including Marathi. Images and questions are annotated by native speakers familiar with local culture

Build a culturally-aware Marathi visual question answering system that understands Indian visual contexts.
Image + Text
CVQA Benchmark Authors

Handwritten character image database with 46 classes (36 characters + 10 digits) of Devanagari script. Each grayscale image is 32x32 pixels. Applicable to Marathi character recognition since Marathi uses Devanagari

Build a Devanagari handwriting recognition model for digitizing handwritten Marathi documents and forms.
Image (handwritten characters)
UCI Machine Learning Repository

Large-scale handwritten Devanagari character dataset containing approximately 4 million character samples, explicitly designed to address the limitations of existing datasets that fail on text containing matras (vowel modifiers) and conjuncts (jodakshara). Covers characters with varying combinations of matras and conjunct forms that appear in real Marathi/Hindi text. The scale and explicit focus on matras and conjuncts makes this one of the most important datasets for training robust Devanagari OCR systems.

Pre-train a Devanagari character recognizer on DevChar's 4M samples for transfer to word-level Marathi OCR.
Image (handwritten character crops with matra/conjunct labels)
Research community

Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.

Build a semi-automated annotation pipeline to create OCR ground truth from DLI Marathi scans using existing OCR + human correction.
Image (scanned book/periodical pages, unannotated)
Digital Library of India / Internet Archive

ICDAR 2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. Covers 10 languages across 7 scripts including Devanagari, applicable to Marathi scene text

Build a multilingual scene text detector that handles Devanagari alongside other scripts in Indian street scenes.
Image (scene text)
ICDAR MLT Organizers

Large-scale Devanagari handwritten word dataset from CVIT, IIIT Hyderabad. Contains word-level images with corrected segmentation. Applicable to Marathi handwriting recognition as Marathi uses Devanagari script

Build a handwritten Marathi word recognition system for digitizing handwritten government documents and records.
Image (handwritten words)
IIIT Hyderabad

Massive handwritten word dataset containing 872,000 word instances across 10 Indic scripts including Devanagari, written by 135 writers. Each writer contributed approximately 6,460 word instances. Includes word-level bounding box annotations and Unicode transcriptions. The scale and writer diversity make this essential for training robust handwritten text recognition systems that generalize across writing styles. Devanagari subset directly applicable to Marathi handwriting recognition.

Train a writer-independent Marathi handwriting recognizer using the Devanagari subset of this large-scale dataset.
Image (handwritten word crops with transcriptions)
CVIT, IIIT Hyderabad

Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.

Build a Marathi document structure analyzer that detects paragraphs, tables, headings, and figures for automated digitization.
Image (document pages with layout annotations)
AI4Bharat / IIT Madras

Indic Scene Text Recognition dataset covering 12 major Indian languages including Marathi. Word images collected from natural scenes such as signboards, shop nameplates, railway stations, advertisements, and banners

Build a robust Devanagari scene text recognition system for reading Marathi text in natural images.
Image (scene text)
AI4Bharat, IIT Madras

Handwritten Devanagari character dataset from the CVPR Unit at Indian Statistical Institute (ISI), Kolkata. Contains 36,172 grayscale character images across 47 character classes covering all basic Devanagari consonants, vowels, and numerals. Collected from multiple writers with natural handwriting variation. One of the earliest and most cited Indian script character recognition benchmark datasets. Also includes a separate ISI Devanagari Numeral Database with 22,556 numeral images from 1,049 writers.

Benchmark modern deep learning character classifiers against this classic ISI Kolkata dataset.
Image (handwritten character crops)
CVPR Unit, Indian Statistical Institute (ISI), Kolkata

Maharashtra's digitized land record system containing 2.11 crore (21.1 million) 7/12 satbara extracts across 358 talukas. Each extract follows a standardized Marathi template with structured fields for survey number, land area, landowner details, crop information, and encumbrances. The records are dynamically generated in Marathi and represent one of the largest standardized Marathi document sources available. Raw unannotated source requiring OCR ground-truth annotation, but the consistent template format makes automated annotation feasible. A community scraping tool exists for aggregation.

Build an annotation pipeline converting MahaBhulekh 7/12 extracts into field-level OCR training data with bounding boxes and transcriptions.
Document images (standardized Marathi form template, unannotated)
Maharashtra Revenue Department, Government of Maharashtra

2,500+ images of full handwritten Marathi text (sentences and paragraphs, not isolated characters). Native speakers wrote pre-designed text covering nearly all Marathi characters, words, and diacritical marks. Fills the gap between character-level datasets (like Devanagari HWR) and real-world handwritten text recognition (HTR).

Build an end-to-end handwritten Marathi text recognition system for digitizing handwritten documents and forms.
image
Independent researcher (Kaggle)

Collection of ~12K Marathi word images with corresponding UTF-8 text labels, sourced from 12 Marathi books across various genres. Images are binarized, thresholded, and resized to 96 dpi for direct neural network input

Build a production-grade Marathi OCR engine for digitizing printed government documents and books.
Image (printed text)
IIT Bombay / IIIT Hyderabad

Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.

Train a multi-script classifier that preprocesses documents by identifying Devanagari vs. Latin regions before running language-specific OCR.
Image (document pages with script identification annotations)
ICDAR 2021 SIW Competition

Handwritten Devanagari character dataset specifically designed to include compound/conjunct characters (jodakshara) alongside basic characters. Contains 36,000 images across 60 classes (10 numerals, 13 vowels, 17 similar-looking consonants, and 20 compound character classes) with 600 balanced images per class. One of the few publicly available datasets that explicitly addresses conjunct character recognition — a major challenge for Marathi/Devanagari OCR where characters like क्ष, ज्ञ, त्र merge into single glyphs. Achieves 99.66% accuracy with CNN 2D.

Build a Devanagari character recognizer that handles both basic and compound characters for robust Marathi OCR.
Image (handwritten character crops with class labels)
Research community

Marathi lip reading dataset containing video recordings of speakers pronouncing Marathi words and phrases, designed for visual speech recognition and lip-reading AI systems. One of the few lip-reading datasets for any Indian language.

Build a Marathi lip reading model for silent speech recognition in noisy environments or for hearing-impaired users.
video
Independent researcher (Kaggle)

Dataset of 2,043 historical Modi script document images paired with Devanagari transliterations. Modi was the official script for writing Marathi from the 12th century until the British colonial period when Devanagari replaced it. This dataset enables training vision-language models (MoScNet architecture) to transliterate Modi documents into modern Devanagari, unlocking centuries of Marathi historical records including Peshwa-era administrative documents, Shivaji Maharaj's correspondence, and Maratha empire legal records.

Build a Modi-to-Devanagari transliteration tool that makes historical Marathi documents readable to modern Marathi speakers.
Image (historical document scans with Devanagari transliterations)
Research community

Large-scale handwritten Modi script character dataset containing 575,920 character images across 57 classes (10 numerals, 12 vowels, 35 consonants). Modi script was the primary writing system for Marathi from the 12th to 20th century. This character-level dataset enables training classifiers for the foundational character recognition stage of historical Marathi document OCR. The scale (575K images) provides sufficient variety for robust recognition across different historical writing styles.

Train a Modi script character classifier achieving high accuracy across all 57 character classes.
Image (handwritten character crops)
Research community

Collection of 3,350 handwritten historical Modi script document images for document-level recognition research. Modi script was used for writing Marathi for over 700 years and vast archives of administrative, legal, and literary documents remain undigitized. This dataset provides full-page document scans suitable for training document-level detection and recognition models for historical Marathi manuscripts.

Build a historical Modi script document preprocessing pipeline handling binarization, noise removal, and line segmentation.
Image (historical handwritten document scans)
Research community

Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.

Train a production-grade Marathi printed text recognizer using Mozhi's large-scale word image corpus.
Image (printed text word/line crops with transcriptions)
CVIT, IIIT Hyderabad / NLTM-Bhashini

Collection of annotated Indian identity document datasets on Roboflow Universe covering Aadhaar cards (2,645 images with field-level bounding boxes), Voter ID cards (1,274 images), PAN cards, and Driving Licenses. Annotations include object detection bounding boxes for key fields (name, number, date of birth, address, photo, gender). While not Marathi-specific, many documents contain Devanagari text fields. These are among the few publicly available annotated Indian government document datasets suitable for training field detection and extraction models.

Build a multi-document Indian KYC processor that detects document type and extracts key fields from identity documents.
Image (identity documents with bounding box annotations)
Community (Roboflow Universe)

Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.

Train a Marathi OCR post-processor that automatically corrects common recognition errors in Devanagari text.
Text (parallel noisy-clean sentence pairs)
IIT Delhi

Handwritten Devanagari character dataset with the widest class diversity available — 602 character classes covering basic vowels, consonants, modifiers, AND hundreds of conjunct/compound character combinations found in Sanskrit texts. Contains 7,702 images (~12.8 per class). While the per-class sample count is low, the class inventory is invaluable as a reference for which conjuncts actually appear in real Devanagari text. Many Sanskrit conjuncts carry over into Marathi vocabulary (e.g., विद्या, संस्कृत, शास्त्र). Essential for building comprehensive conjunct recognition models.

Use the 602-class inventory to build a comprehensive conjunct coverage test suite for Marathi OCR evaluation.
Image (handwritten character crops)
Research community (DAS 2018)

Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.

Fine-tune a TrOCR model on this Marathi dataset for high-accuracy printed Devanagari text recognition.
Image (printed text line/word crops with transcriptions)
Community