IndicDLP - Indic Document Layout Parsing

IndicDLP - Indic Document Layout Parsing

MH Subset Needed

Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.

Build a Marathi document structure analyzer that detects paragraphs, tables, headings, and figures for automated digitization.
Maharashtra subset not yet extracted. This is a global dataset that contains data covering Maharashtra. A regional subset can be extracted by filtering on geographic coordinates or administrative boundaries.

Quick Start

from datasets import load_dataset

ds = load_dataset("IndicDLP/IndicDLP-dataset")
# Filter for Marathi documents
mr_docs = [x for x in ds['train'] if x['language'] == 'mr']
print(f"Marathi document pages: {len(mr_docs)}")
print(f"Layout classes: {set(a['category'] for doc in mr_docs for a in doc['annotations'])}")
Modality
Image (document pages with layout annotations)
Size
119,809 annotated document pages; 12 domains; 42 layout classes
License
Format
PNG/JPEG with COCO-format annotations
Language
mr, hi, en, bn, ta, te, kn, ml, gu, pa, or, as
Update Frequency
static
Organization
AI4Bharat / IIT Madras

Schema

FieldTypeDescription
imageimageFull-page document scan
annotationsjsonCOCO-format bounding boxes with layout class labels
categorystringLayout element class (paragraph, heading, table, figure, etc.)
languagestringDocument language
domainstringDocument type (novel, textbook, newspaper, form, etc.)

Build With This

Create a Marathi document OCR pipeline combining IndicDLP layout detection with text recognition for end-to-end digitization
Develop a government form field extractor using layout analysis to automatically parse structured Marathi forms
Build a Marathi newspaper article segmenter separating headlines, body text, images, and advertisements

AI Use Cases

Document layout analysis for Marathi textsTable detection and extraction from Marathi documentsDocument structure understanding for OCR pipelinesMarathi document digitization preprocessing
Last verified: 2026-03-12