RoundTripOCR - Post-OCR Error Correction Dataset

MH Specific

Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.

Train a Marathi OCR post-processor that automatically corrects common recognition errors in Devanagari text.

Homepage GitHub Paper

Quick Start

# Clone from https://github.com/harshvivek14/RoundTripOCR
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load noisy-clean sentence pairs
df = pd.read_csv('roundtripocr_marathi.tsv', sep='\t')
print(f"Marathi sentence pairs: {len(df)}")
print(f"Sample noisy: {df['noisy_text'].iloc[0]}")
print(f"Sample clean: {df['clean_text'].iloc[0]}")

# Fine-tune mBART or mT5 for post-OCR correction
# model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50")

Modality

Text (parallel noisy-clean sentence pairs)

Size

1.58M Marathi sentence pairs (also includes Hindi 3.1M, Nepali 2.97M)

License

Not specified

Format

TSV/CSV

Language

mr, hi, ne

Update Frequency

static

Organization

IIT Delhi

Schema

Field	Type	Description
noisy_text	string	Simulated OCR output with realistic errors
clean_text	string	Corrected ground truth text
error_type	string	Type of OCR error (substitution, deletion, insertion, segmentation)

Build With This

Create an end-to-end Marathi OCR pipeline with integrated post-correction achieving near-human accuracy

Develop a Devanagari OCR error taxonomy analyzing which character confusions are most common and building targeted corrections

Build a quality scoring system that rates OCR output confidence and flags passages needing human review

AI Use Cases

Post-OCR error correction for Marathi textOCR output quality improvementDevanagari text normalizationNoisy text correction for downstream NLP tasks

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12