Thousands of scanned Marathi books, periodicals, and historical publications hosted on Internet Archive as part of the Digital Library of India project. Contains high-resolution page scans in TIFF/JPEG/PDF format covering literature, government publications, religious texts, and historical periodicals. These are raw unannotated scans without OCR ground-truth transcriptions — they represent a massive source of real-world printed Marathi page images suitable for OCR training data creation, document layout annotation, and historical text digitization projects. Includes materials from 19th and 20th century Marathi publishing.
import requests
from bs4 import BeautifulSoup
# Search Digital Library of India for Marathi books
# https://archive.org/details/digitallibraryindia
url = "https://archive.org/search?query=language:Marathi+collection:digitallibraryindia"
print("DLI Marathi: Thousands of scanned volumes on Internet Archive")
print("Download individual pages as JPEG/TIFF for OCR annotation")| Field | Type | Description |
|---|---|---|
| page_image | image | Scanned page image (unannotated, needs OCR ground-truth creation) |
| volume_title | string | Book or periodical title |
| publication_year | int | Year of publication |
| publisher | string | Publisher name |