Digital Library of India - Scanned Marathi Books & Periodicals

Digital Library of India - Scanned Marathi Books & Periodicals

MH Specific MH Subset Needed

Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.

Build a semi-automated annotation pipeline to create OCR ground truth from DLI Marathi scans using existing OCR + human correction.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.
HomepageDownload

Quick Start

import requests
from bs4 import BeautifulSoup

# Search Digital Library of India for Marathi books
# https://archive.org/details/digitallibraryindia
url = "https://archive.org/search?query=language:Marathi+collection:digitallibraryindia"
print("DLI Marathi: Thousands of scanned volumes on Internet Archive")
print("Download individual pages as JPEG/TIFF for OCR annotation")
Modality
Image (scanned book/periodical pages, unannotated)
Size
Thousands of scanned Marathi volumes; millions of page images
License
Format
TIFF / JPEG / PDF (scanned pages)
Language
mr, sa
Update Frequency
static
Organization
Digital Library of India / Internet Archive

Schema

FieldTypeDescription
page_imageimageScanned page image (unannotated, needs OCR ground-truth creation)
volume_titlestringBook or periodical title
publication_yearintYear of publication
publisherstringPublisher name

Build With This

Create a Marathi OCR training corpus by running existing OCR on DLI scans and crowdsourcing corrections
Develop a historical Marathi font catalog by extracting and classifying typefaces across publishing eras from DLI scans
Build a document quality scorer that rates scan clarity and recommends preprocessing steps for each DLI volume

AI Use Cases

Raw source material for Marathi OCR training data creationHistorical Marathi document digitizationFont diversity sampling across printing erasDocument degradation modeling from real aged scans
Last verified: 2026-03-12