Curated Marathi text corpus of 2,157,109 words across 678 titles collected from books, magazines, and newspapers by the Linguistic Data Consortium for Indian Languages (LDC-IL) under the Ministry of Electronics and IT. The corpus includes newspaper-sourced Marathi text that can serve as ground-truth for synthetic OCR training data generation — render the text using Devanagari fonts with degradation effects to create paired image-text training data. Also valuable as a language model corpus for post-OCR beam search decoding and error correction.
# Download from https://data.ldcil.org/
# Register for access to LDC-IL resources
import xml.etree.ElementTree as ET
# Parse LDC-IL Marathi corpus
# tree = ET.parse('ldcil_marathi_corpus.xml')
print("LDC-IL Marathi: 2.1M words from books, magazines, newspapers")
print("Use as ground-truth text for synthetic OCR data generation")| Field | Type | Description |
|---|---|---|
| text | string | Marathi text content |
| source_type | string | Source type (book, magazine, newspaper) |
| title | string | Source title |