Digital Library of India - Scanned Marathi Books & Periodicals

MH Specific MH Subset Needed

Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.

Build a semi-automated annotation pipeline to create OCR ground truth from DLI Marathi scans using existing OCR + human correction.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage Download

Quick Start

import requests
from bs4 import BeautifulSoup

# Search Digital Library of India for Marathi books
# https://archive.org/details/digitallibraryindia
url = "https://archive.org/search?query=language:Marathi+collection:digitallibraryindia"
print("DLI Marathi: Thousands of scanned volumes on Internet Archive")
print("Download individual pages as JPEG/TIFF for OCR annotation")

Modality

Image (scanned book/periodical pages, unannotated)

Size

Thousands of scanned Marathi volumes; millions of page images

License

Public domain / Out of copyright

Format

TIFF / JPEG / PDF (scanned pages)

Language

mr, sa

Update Frequency

static

Organization

Digital Library of India / Internet Archive

Schema

Field	Type	Description
page_image	image	Scanned page image (unannotated, needs OCR ground-truth creation)
volume_title	string	Book or periodical title
publication_year	int	Year of publication
publisher	string	Publisher name

Build With This

Create a Marathi OCR training corpus by running existing OCR on DLI scans and crowdsourcing corrections

Develop a historical Marathi font catalog by extracting and classifying typefaces across publishing eras from DLI scans

Build a document quality scorer that rates scan clarity and recommends preprocessing steps for each DLI volume

AI Use Cases

Raw source material for Marathi OCR training data creationHistorical Marathi document digitizationFont diversity sampling across printing erasDocument degradation modeling from real aged scans

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12