Large-scale post-OCR error correction dataset containing 1.58 million Marathi sentence pairs (noisy OCR output paired with corrected ground truth) for training OCR post-processing models. Generated using a round-trip translation approach through Hindi/Nepali to create realistic OCR-like errors. Enables training mBART, mT5, and other sequence-to-sequence models to automatically correct Devanagari OCR errors including character substitutions, missing matras, broken conjuncts, and segmentation artifacts.
# Clone from https://github.com/harshvivek14/RoundTripOCR
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load noisy-clean sentence pairs
df = pd.read_csv('roundtripocr_marathi.tsv', sep='\t')
print(f"Marathi sentence pairs: {len(df)}")
print(f"Sample noisy: {df['noisy_text'].iloc[0]}")
print(f"Sample clean: {df['clean_text'].iloc[0]}")
# Fine-tune mBART or mT5 for post-OCR correction
# model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50")| Field | Type | Description |
|---|---|---|
| noisy_text | string | Simulated OCR output with realistic errors |
| clean_text | string | Corrected ground truth text |
| error_type | string | Type of OCR error (substitution, deletion, insertion, segmentation) |