LDC-IL Marathi Gold Standard Text Corpus

LDC-IL Marathi Gold Standard Text Corpus

MH Specific

Curated Marathi text corpus of 2,157,109 words across 678 titles collected from books, magazines, and newspapers by the Linguistic Data Consortium for Indian Languages (LDC-IL) under the Ministry of Electronics and IT. The corpus includes newspaper-sourced Marathi text that can serve as ground-truth for synthetic OCR training data generation — render the text using Devanagari fonts with degradation effects to create paired image-text training data. Also valuable as a language model corpus for post-OCR beam search decoding and error correction.

Generate millions of synthetic Marathi OCR training images by rendering LDC-IL text in diverse Devanagari fonts with degradation.
HomepageDownload

Quick Start

# Download from https://data.ldcil.org/
# Register for access to LDC-IL resources
import xml.etree.ElementTree as ET

# Parse LDC-IL Marathi corpus
# tree = ET.parse('ldcil_marathi_corpus.xml')
print("LDC-IL Marathi: 2.1M words from books, magazines, newspapers")
print("Use as ground-truth text for synthetic OCR data generation")
Modality
Text (Marathi)
Size
2,157,109 words; 678 titles
License
Format
XML / Text
Language
mr
Update Frequency
static
Organization
LDC-IL, Ministry of Electronics and IT, Government of India

Schema

FieldTypeDescription
textstringMarathi text content
source_typestringSource type (book, magazine, newspaper)
titlestringSource title

Build With This

Create a synthetic Marathi newspaper OCR dataset by rendering LDC-IL newspaper text in newspaper-style layouts with column formatting
Develop a Marathi OCR vocabulary and language model using LDC-IL's curated 2.1M word corpus for beam search decoding
Build a Marathi conjunct character frequency analyzer using LDC-IL text to identify rare jodakshara needing special OCR attention

AI Use Cases

Ground-truth text source for synthetic OCR data generationLanguage model training for post-OCR correctionMarathi text normalization and spell-checkingVocabulary coverage analysis for OCR lexicon building
Last verified: 2026-03-12