LDC-IL Marathi Gold Standard Text Corpus

MH Specific

Curated Marathi text corpus of 2,157,109 words across 678 titles collected from books, magazines, and newspapers by the Linguistic Data Consortium for Indian Languages (LDC-IL) under the Ministry of Electronics and IT. The corpus includes newspaper-sourced Marathi text that can serve as ground-truth for synthetic OCR training data generation — render the text using Devanagari fonts with degradation effects to create paired image-text training data. Also valuable as a language model corpus for post-OCR beam search decoding and error correction.

Generate millions of synthetic Marathi OCR training images by rendering LDC-IL text in diverse Devanagari fonts with degradation.

Homepage Download

Quick Start

# Download from https://data.ldcil.org/
# Register for access to LDC-IL resources
import xml.etree.ElementTree as ET

# Parse LDC-IL Marathi corpus
# tree = ET.parse('ldcil_marathi_corpus.xml')
print("LDC-IL Marathi: 2.1M words from books, magazines, newspapers")
print("Use as ground-truth text for synthetic OCR data generation")

Modality

Text (Marathi)

Size

2,157,109 words; 678 titles

License

Gov Open (LDC-IL terms)

Format

XML / Text

Language

Update Frequency

static

Organization

LDC-IL, Ministry of Electronics and IT, Government of India

Schema

Field	Type	Description
text	string	Marathi text content
source_type	string	Source type (book, magazine, newspaper)
title	string	Source title

Build With This

Create a synthetic Marathi newspaper OCR dataset by rendering LDC-IL newspaper text in newspaper-style layouts with column formatting

Develop a Marathi OCR vocabulary and language model using LDC-IL's curated 2.1M word corpus for beam search decoding

Build a Marathi conjunct character frequency analyzer using LDC-IL text to identify rare jodakshara needing special OCR attention

AI Use Cases

Ground-truth text source for synthetic OCR data generationLanguage model training for post-OCR correctionMarathi text normalization and spell-checkingVocabulary coverage analysis for OCR lexicon building

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-12