TrOCR Marathi Printed Text Dataset

TrOCR Marathi Printed Text Dataset

MH Specific

Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.

Fine-tune a TrOCR model on this Marathi dataset for high-accuracy printed Devanagari text recognition.
HomepageGitHub

Quick Start

# Clone from https://github.com/MubashirTanwar/TrOCR
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load pre-trained TrOCR and fine-tune on Marathi data
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Fine-tune on Marathi word/line images
print("TrOCR Marathi: 10K+ printed text images for fine-tuning")
Modality
Image (printed text line/word crops with transcriptions)
Size
2,671 line images + 8,077 word images
License
Format
PNG with text labels
Language
mr
Update Frequency
static
Organization
Community

Schema

FieldTypeDescription
imageimageCropped line or word image from printed Marathi text
textstringUnicode ground-truth transcription in Marathi
levelstringAnnotation level (line or word)

Build With This

Create a comparative study of TrOCR vs. CRNN vs. PaddleOCR performance on Marathi printed text
Develop a data augmentation pipeline expanding this dataset with synthetic variations for more robust training
Build an active learning system that identifies the hardest Marathi words for OCR and prioritizes them for annotation

AI Use Cases

Fine-tuning TrOCR and similar vision-language models for MarathiPrinted Marathi text line recognitionBenchmarking OCR systems on Marathi printed textTransfer learning baseline for Devanagari OCR
Last verified: 2026-03-12