RoundTripOCR - Post-OCR Error Correction Dataset

RoundTripOCR - Post-OCR Error Correction Dataset

MH Specific

Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.

Train a Marathi OCR post-processor that automatically corrects common recognition errors in Devanagari text.
HomepageGitHubPaper

Quick Start

# Clone from https://github.com/harshvivek14/RoundTripOCR
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load noisy-clean sentence pairs
df = pd.read_csv('roundtripocr_marathi.tsv', sep='\t')
print(f"Marathi sentence pairs: {len(df)}")
print(f"Sample noisy: {df['noisy_text'].iloc[0]}")
print(f"Sample clean: {df['clean_text'].iloc[0]}")

# Fine-tune mBART or mT5 for post-OCR correction
# model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50")
Modality
Text (parallel noisy-clean sentence pairs)
Size
1.58M Marathi sentence pairs (also includes Hindi 3.1M, Nepali 2.97M)
License
Format
TSV/CSV
Language
mr, hi, ne
Update Frequency
static
Organization
IIT Delhi

Schema

FieldTypeDescription
noisy_textstringSimulated OCR output with realistic errors
clean_textstringCorrected ground truth text
error_typestringType of OCR error (substitution, deletion, insertion, segmentation)

Build With This

Create an end-to-end Marathi OCR pipeline with integrated post-correction achieving near-human accuracy
Develop a Devanagari OCR error taxonomy analyzing which character confusions are most common and building targeted corrections
Build a quality scoring system that rates OCR output confidence and flags passages needing human review

AI Use Cases

Post-OCR error correction for Marathi textOCR output quality improvementDevanagari text normalizationNoisy text correction for downstream NLP tasks
Last verified: 2026-03-12