AIKOSH IIT Bombay Indic Datasets (IndiaAI)

AIKOSH IIT Bombay Indic Datasets (IndiaAI)

Collection of 16 Indic language datasets from IIT Bombay hosted on IndiaAI's AIKOSH platform as part of the BharatGen initiative. Includes handwritten and printed Devanagari script images, scanned table recognition data, 78+ hours of multilingual audio, QA pairs, and math word problems. Covers Marathi plus 9 other Indian languages.

Build a multimodal Marathi document understanding system using AIKosh vision-language datasets.
HomepageDownload

Quick Start

# Access from https://aikosh.indiaai.gov.in/
print('AIKosh - India AI Indic Datasets')
print('Register at aikosh.indiaai.gov.in for access')
Modality
multimodal
Size
16 datasets; handwritten/printed script images, 78+ hrs audio, QA pairs
License
Format
Various (images, audio, text)
Language
mr, hi, en
Update Frequency
static
Organization
IIT Bombay / IndiaAI Mission

Schema

FieldTypeDescription
imageimageImage file for vision tasks
labelstringClassification label or annotation
languagestringLanguage code for text components

Build With This

Create a Marathi visual question answering model for document images like government forms and certificates
Develop an Indian scene text reader that handles Devanagari signage in Maharashtra urban environments
Build a Marathi image captioning model trained on Indian-context visual data from AIKosh

AI Use Cases

Marathi LLM fine-tuning for government chatbotsOCR of scanned land records and government documentsHandwritten Devanagari recognitionSpeech recognition for rural usersMarathi math word problem solving
Last verified: 2026-03-09