Curated dataset of printed Marathi text images for training TrOCR (Transformer-based OCR) models. Contains 2,671 line-level and 8,077 word-level PNG images extracted from printed Marathi documents with corresponding Unicode ground-truth transcriptions. Specifically designed for fine-tuning pre-trained vision-language models on Marathi printed text recognition. Includes diverse font styles and document types.
# Clone from https://github.com/MubashirTanwar/TrOCR
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
# Load pre-trained TrOCR and fine-tune on Marathi data
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Fine-tune on Marathi word/line images
print("TrOCR Marathi: 10K+ printed text images for fine-tuning")| Field | Type | Description |
|---|---|---|
| image | image | Cropped line or word image from printed Marathi text |
| text | string | Unicode ground-truth transcription in Marathi |
| level | string | Annotation level (line or word) |