TrOCR Marathi Printed Text Dataset

MH Specific

Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.

Fine-tune a TrOCR model on this Marathi dataset for high-accuracy printed Devanagari text recognition.

Homepage GitHub

Quick Start

# Clone from https://github.com/MubashirTanwar/TrOCR
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load pre-trained TrOCR and fine-tune on Marathi data
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Fine-tune on Marathi word/line images
print("TrOCR Marathi: 10K+ printed text images for fine-tuning")

Modality

Image (printed text line/word crops with transcriptions)

Size

2,671 line images + 8,077 word images

License

Not specified

Format

PNG with text labels

Language

Update Frequency

static

Organization

Community

Schema

Field	Type	Description
image	image	Cropped line or word image from printed Marathi text
text	string	Unicode ground-truth transcription in Marathi
level	string	Annotation level (line or word)

Build With This

Create a comparative study of TrOCR vs. CRNN vs. PaddleOCR performance on Marathi printed text

Develop a data augmentation pipeline expanding this dataset with synthetic variations for more robust training

Build an active learning system that identifies the hardest Marathi words for OCR and prioritizes them for annotation

AI Use Cases

Fine-tuning TrOCR and similar vision-language models for MarathiPrinted Marathi text line recognitionBenchmarking OCR systems on Marathi printed textTransfer learning baseline for Devanagari OCR

Related Datasets

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

multimodal

Bharat Scene Text Dataset (BSTD)

Image (scene text)

CHIPS - Corpus of Handwritten Indic Scripts (Page-Level OCR)

Image (full-page handwritten documents with detection + recognition annotations)

CMATERdb - Devanagari-Roman Mixed-Script Handwritten Documents

Image (handwritten mixed-script document pages with word-level annotations)

Last verified: 2026-03-12