Computer vision, optical character recognition, and multimodal datasets for Devanagari and Marathi content.
29 datasets
Collection of 16 Indic language datasets from IIT Bombay hosted on IndiaAI's AIKOSH platform as part of the BharatGen initiative. Includes handwritten and printed Devanagari script images, scanned table recognition data, 78+ hours of multilingual audio, QA pairs, and math word problems. Covers Marathi plus 9 other Indian languages.
Large-scale scene text dataset for 11 Indian languages plus English, sourced from Wikimedia images of Indian signboards and street scenes. Includes 5,113 Marathi word annotations with polygon bounding boxes
Page-level handwritten OCR dataset for Indic scripts with both text detection and recognition annotations. Part of the PLATTER (Page-Level Handwritten Text Recognition) project. Unlike word-level datasets, CHIPS provides full-page handwritten document images with bounding box annotations for text regions plus Unicode transcriptions, enabling training end-to-end page-level OCR systems that handle detection and recognition jointly. Covers multiple Indic scripts including Devanagari.
Benchmark dataset of 150 handwritten document pages containing intermixed Devanagari and Roman (Latin/English) text within the same page, with word-level script annotations. Contains 15,528 annotated Devanagari words and 10,331 Roman words (44,790 total extracted word images). The only publicly available mixed-script document dataset featuring Devanagari-Latin co-occurrence. Essential for training script identification modules in bilingual OCR pipelines handling Marathi-English mixed documents. Achieves 95.30% word-level script ID accuracy.
Marathi translations of MS-COCO image captions, verified by native Marathi speakers for linguistic accuracy and contextual integrity. Useful for training image captioning and cross-lingual retrieval models
Culturally-diverse Multilingual Visual Question Answering benchmark with questions from 30 countries in 31 languages including Marathi. Images and questions are annotated by native speakers familiar with local culture
Handwritten character image database with 46 classes (36 characters + 10 digits) of Devanagari script. Each grayscale image is 32x32 pixels. Applicable to Marathi character recognition since Marathi uses Devanagari
Large-scale handwritten Devanagari character dataset containing approximately 4 million character samples, explicitly designed to address the limitations of existing datasets that fail on text containing matras (vowel modifiers) and conjuncts (jodakshara). Covers characters with varying combinations of matras and conjunct forms that appear in real Marathi/Hindi text. The scale and explicit focus on matras and conjuncts makes this one of the most important datasets for training robust Devanagari OCR systems.
Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.
ICDAR 2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. Covers 10 languages across 7 scripts including Devanagari, applicable to Marathi scene text
Large-scale Devanagari handwritten word dataset from CVIT, IIIT Hyderabad. Contains word-level images with corrected segmentation. Applicable to Marathi handwriting recognition as Marathi uses Devanagari script
Massive handwritten word dataset containing 872,000 word instances across 10 Indic scripts including Devanagari, written by 135 writers. Each writer contributed approximately 6,460 word instances. Includes word-level bounding box annotations and Unicode transcriptions. The scale and writer diversity make this essential for training robust handwritten text recognition systems that generalize across writing styles. Devanagari subset directly applicable to Marathi handwriting recognition.
Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.
Indic Scene Text Recognition dataset covering 12 major Indian languages including Marathi. Word images collected from natural scenes such as signboards, shop nameplates, railway stations, advertisements, and banners
Handwritten Devanagari character dataset from the CVPR Unit at Indian Statistical Institute (ISI), Kolkata. Contains 36,172 grayscale character images across 47 character classes covering all basic Devanagari consonants, vowels, and numerals. Collected from multiple writers with natural handwriting variation. One of the earliest and most cited Indian script character recognition benchmark datasets. Also includes a separate ISI Devanagari Numeral Database with 22,556 numeral images from 1,049 writers.
Maharashtra's digitized land record system containing 2.11 crore (21.1 million) 7/12 satbara extracts across 358 talukas. Each extract follows a standardized Marathi template with structured fields for survey number, land area, landowner details, crop information, and encumbrances. The records are dynamically generated in Marathi and represent one of the largest standardized Marathi document sources available. Raw unannotated source requiring OCR ground-truth annotation, but the consistent template format makes automated annotation feasible. A community scraping tool exists for aggregation.
2,500+ images of full handwritten Marathi text (sentences and paragraphs, not isolated characters). Native speakers wrote pre-designed text covering nearly all Marathi characters, words, and diacritical marks. Fills the gap between character-level datasets (like Devanagari HWR) and real-world handwritten text recognition (HTR).
Collection of ~12K Marathi word images with corresponding UTF-8 text labels, sourced from 12 Marathi books across various genres. Images are binarized, thresholded, and resized to 96 dpi for direct neural network input
Multi-script document dataset containing 1,135 document images with 13,979 text lines and 86,655 words across 13 scripts including Devanagari and Roman (Latin). Used in the ICDAR 2021 Script Identification in the Wild (SIW) competition. Sources include printed newspapers and handwritten letters with word-level and line-level script annotations. Enables training robust script identification models that can distinguish Devanagari from Latin and other scripts in real-world mixed-script documents — a critical preprocessing step for bilingual Marathi-English OCR pipelines.
Handwritten Devanagari character dataset specifically designed to include compound/conjunct characters (jodakshara) alongside basic characters. Contains 36,000 images across 60 classes (10 numerals, 13 vowels, 17 similar-looking consonants, and 20 compound character classes) with 600 balanced images per class. One of the few publicly available datasets that explicitly addresses conjunct character recognition — a major challenge for Marathi/Devanagari OCR where characters like क्ष, ज्ञ, त्र merge into single glyphs. Achieves 99.66% accuracy with CNN 2D.
Marathi lip reading dataset containing video recordings of speakers pronouncing Marathi words and phrases, designed for visual speech recognition and lip-reading AI systems. One of the few lip-reading datasets for any Indian language.
Dataset of 2,043 historical Modi script document images paired with Devanagari transliterations. Modi was the official script for writing Marathi from the 12th century until the British colonial period when Devanagari replaced it. This dataset enables training vision-language models (MoScNet architecture) to transliterate Modi documents into modern Devanagari, unlocking centuries of Marathi historical records including Peshwa-era administrative documents, Shivaji Maharaj's correspondence, and Maratha empire legal records.
Large-scale handwritten Modi script character dataset containing 575,920 character images across 57 classes (10 numerals, 12 vowels, 35 consonants). Modi script was the primary writing system for Marathi from the 12th to 20th century. This character-level dataset enables training classifiers for the foundational character recognition stage of historical Marathi document OCR. The scale (575K images) provides sufficient variety for robust recognition across different historical writing styles.
Collection of 3,350 handwritten historical Modi script document images for document-level recognition research. Modi script was used for writing Marathi for over 700 years and vast archives of administrative, legal, and literary documents remain undigitized. This dataset provides full-page document scans suitable for training document-level detection and recognition models for historical Marathi manuscripts.
Large-scale printed document OCR dataset from IIIT Hyderabad's CVIT lab and the NLTM-Bhashini project containing 1.2 million annotated word images and approximately 120,000 text line images across 13 Indian languages including Marathi. Sourced from scanned books, textbooks, and printed documents. Provides word-level and line-level cropped images paired with Unicode ground-truth transcriptions. The largest publicly available printed Indic OCR dataset, essential for training robust printed text recognizers.
Collection of annotated Indian identity document datasets on Roboflow Universe covering Aadhaar cards (2,645 images with field-level bounding boxes), Voter ID cards (1,274 images), PAN cards, and Driving Licenses. Annotations include object detection bounding boxes for key fields (name, number, date of birth, address, photo, gender). While not Marathi-specific, many documents contain Devanagari text fields. These are among the few publicly available annotated Indian government document datasets suitable for training field detection and extraction models.
Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.
Handwritten Devanagari character dataset with the widest class diversity available — 602 character classes covering basic vowels, consonants, modifiers, AND hundreds of conjunct/compound character combinations found in Sanskrit texts. Contains 7,702 images (~12.8 per class). While the per-class sample count is low, the class inventory is invaluable as a reference for which conjuncts actually appear in real Devanagari text. Many Sanskrit conjuncts carry over into Marathi vocabulary (e.g., विद्या, संस्कृत, शास्त्र). Essential for building comprehensive conjunct recognition models.
Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.