IndicDLP - Indic Document Layout Parsing

MH Subset Needed

Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.

Build a Marathi document structure analyzer that detects paragraphs, tables, headings, and figures for automated digitization.

Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Homepage HuggingFace GitHub Paper

Quick Start

from datasets import load_dataset

ds = load_dataset("IndicDLP/IndicDLP-dataset")
# Filter for Marathi documents
mr_docs = [x for x in ds['train'] if x['language'] == 'mr']
print(f"Marathi document pages: {len(mr_docs)}")
print(f"Layout classes: {set(a['category'] for doc in mr_docs for a in doc['annotations'])}")

Modality

Image (document pages with layout annotations)

Size

119,809 annotated document pages; 12 domains; 42 layout classes

License

CC BY 4.0

Format

PNG/JPEG with COCO-format annotations

Language

mr, hi, en, bn, ta, te, kn, ml, gu, pa, or, as

Update Frequency

static

Organization

AI4Bharat / IIT Madras

Schema

Field	Type	Description
image	image	Full-page document scan
annotations	json	COCO-format bounding boxes with layout class labels
category	string	Layout element class (paragraph, heading, table, figure, etc.)
language	string	Document language
domain	string	Document type (novel, textbook, newspaper, form, etc.)

Build With This

Create a Marathi document OCR pipeline combining IndicDLP layout detection with text recognition for end-to-end digitization

Develop a government form field extractor using layout analysis to automatically parse structured Marathi forms

Build a Marathi newspaper article segmenter separating headlines, body text, images, and advertisements

AI Use Cases

Document layout analysis for Marathi textsTable detection and extraction from Marathi documentsDocument structure understanding for OCR pipelinesMarathi document digitization preprocessing

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12