Large-scale document layout analysis dataset covering 119,809 annotated document images across 11 Indic languages plus English, including Marathi. Spans 12 document domains (novels, textbooks, magazines, newspapers, government acts and rules, forms, question papers, certificates, annual reports, handwritten documents, scientific papers, and mixed-content pages). Annotated with 42 physical and logical layout classes including paragraphs, headings, tables, figures, captions, headers, footers, page numbers, equations, and footnotes. Winner of ICDAR 2025 Best Student Paper Runner-Up.
from datasets import load_dataset
ds = load_dataset("IndicDLP/IndicDLP-dataset")
# Filter for Marathi documents
mr_docs = [x for x in ds['train'] if x['language'] == 'mr']
print(f"Marathi document pages: {len(mr_docs)}")
print(f"Layout classes: {set(a['category'] for doc in mr_docs for a in doc['annotations'])}")| Field | Type | Description |
|---|---|---|
| image | image | Full-page document scan |
| annotations | json | COCO-format bounding boxes with layout class labels |
| category | string | Layout element class (paragraph, heading, table, figure, etc.) |
| language | string | Document language |
| domain | string | Document type (novel, textbook, newspaper, form, etc.) |