Data Gaps Analysis

Comparing the Marathi & Maharashtra data ecosystem against well-resourced languages and regions. These gaps represent opportunities — datasets that could be created, gathered, or digitized to unlock new AI capabilities for Marathi speakers.

Missing

Sparse

Total Gaps

Language & NLP Speech & Audio Vision, OCR & Multimodal Geospatial & GIS Agriculture & Rural Health & Nutrition Education & Skills Economy, Labour & Finance Environment, Climate & Disaster Transport & Urban Infrastructure Governance, Census & Legal Culture, Media & Heritage Real-Time Streams & APIs Agentic, Instruction & RAG Benchmarks, Tools & Dialects

Conversational / Dialogue datasets

missingLanguage & NLP

No dedicated multi-turn Marathi dialogue or conversational datasets exist.

Global Benchmark

English has DailyDialog, PersonaChat, MultiWOZ; Hindi has limited options.

Opportunity

Critical for building Marathi chatbots, virtual assistants, and customer service AI.

Commonsense reasoning

missingLanguage & NLP

No Marathi commonsense knowledge graphs or reasoning datasets.

Global Benchmark

English has ConceptNet, ATOMIC, WinoGrande, HellaSwag.

Opportunity

Needed for Marathi LLMs to understand cultural context and implicit knowledge.

Hate speech / toxicity (large-scale)

sparseLanguage & NLP

L3Cube-MahaHate exists but is small. No large-scale, multi-platform toxicity dataset.

Global Benchmark

English has Jigsaw (1.8M comments), HateXplain with rationale annotations.

Opportunity

Essential for content moderation on Marathi social media platforms.

Summarization corpus

sparseLanguage & NLP

IndicSentenceSummarization has Marathi but no long-document summarization dataset.

Global Benchmark

English has CNN/DailyMail, XSum; Chinese has LCSTS.

Opportunity

Required for Marathi news summarization, document understanding, and report generation.

Emotional speech

missingSpeech & Audio

No Marathi speech emotion recognition (SER) dataset with emotion labels.

Global Benchmark

English has IEMOCAP, RAVDESS, CREMA-D; Hindi has limited options.

Opportunity

Needed for call center sentiment analysis, mental health monitoring, accessibility.

Noisy / real-world speech

missingSpeech & Audio

Existing Marathi ASR datasets are mostly read speech. No noisy, spontaneous, or code-mixed speech data.

Global Benchmark

English has CHiME, VoxCeleb, AMI Meeting Corpus.

Opportunity

Real-world Marathi speech recognition requires training on market, street, farm environments.

Text-to-Speech (multi-speaker)

sparseSpeech & Audio

IndicTTS has 1 Marathi speaker. No multi-speaker, multi-style TTS corpus.

Global Benchmark

English has LibriTTS (2,456 speakers), VCTK (110 speakers).

Opportunity

Needed for natural-sounding Marathi voice assistants and audiobook generation.

Devanagari scene text in the wild

sparseVision, OCR & Multimodal

BSTD (5K+ Marathi words), IndicSTR12 (27K+), and ICDAR MLT-2019 exist but are modest. No large-scale Marathi-only street sign / billboard dataset.

Global Benchmark

Chinese has CTW, RCTW with 100K+ street images.

Opportunity

Required for Marathi navigation apps, automated sign reading, smart city infrastructure.

Medical imaging with Marathi reports

missingVision, OCR & Multimodal

No paired medical image + Marathi radiology report dataset exists.

Global Benchmark

English has MIMIC-CXR (377K images), CheXpert.

Opportunity

Would enable AI-assisted radiology reporting in Marathi for rural hospital networks.

Document layout analysis (Marathi)

sparseVision, OCR & Multimodal

IndicDLP covers 119K pages across 12 Indic languages including Marathi with 42 layout classes, but no Marathi-specific dataset with government form field-level (key-value) annotations exists. No FUNSD/XFUND equivalent for Indian languages.

Global Benchmark

English has PubLayNet, DocBank, FUNSD. Chinese has CDLA. XFUND covers 7 languages but no Indian.

Opportunity

Key for digitizing Maharashtra government records (7/12 extracts, certificates) and automating form processing.

Annotated Marathi newspaper scan corpus

missingVision, OCR & Multimodal

No annotated dataset of scanned Marathi newspaper pages with OCR ground-truth transcriptions. IndicDLP has ~540 Marathi newspaper pages with layout boxes but no text labels. Raw scans exist in Digital Library of India archives.

Global Benchmark

Hindi has FIRE-RISOT (100K newspaper articles). Chinese has extensive newspaper OCR corpora.

Opportunity

Newspapers are a primary source of printed Marathi text. Would massively boost printed OCR model quality.

Marathi-English bilingual document OCR

missingVision, OCR & Multimodal

No dedicated dataset for OCR on bilingual Marathi-English documents (government forms, exam papers, bilingual signage). CMATERdb has 150 Devanagari-Roman handwritten pages but no printed bilingual document data.

Global Benchmark

AksharaOCR provides Sinhala-English mixed OCR (24K+ lines). No equivalent for Marathi-English.

Opportunity

Most Maharashtra government forms are bilingual. Critical for real-world document digitization.

Comprehensive Devanagari conjunct character dataset

sparseVision, OCR & Multimodal

MKI-26 covers 20 conjunct classes, Sanskrit Letter Dataset has 602 classes but only ~13 images each, DevChar has 4M characters with conjuncts. No single dataset comprehensively covers all ~360 common Devanagari conjuncts with sufficient samples per class for deep learning.

Global Benchmark

Latin OCR has complete glyph coverage. Chinese has stroke-level datasets for all common characters.

Opportunity

Conjuncts are the #1 error source in Devanagari OCR. A focused dataset would directly improve Marathi OCR accuracy.

High-res building footprints for Maharashtra

sparseGeospatial & GIS

OpenStreetMap has partial coverage. No comprehensive, official building footprint dataset.

Global Benchmark

US has Microsoft Building Footprints (125M), Google Open Buildings covers Africa.

Opportunity

Needed for urban planning, disaster response, property tax assessment in MH.

Agricultural land parcel boundaries

missingGeospatial & GIS

No digitized, geo-referenced field boundary dataset for Maharashtra farmland.

Global Benchmark

EU has LPIS with field-level boundaries. US has CLU (Common Land Unit).

Opportunity

Would transform precision agriculture, crop insurance, and land record modernization.

Electronic health records (Marathi)

missingHealth & Nutrition

No anonymized Marathi EHR or clinical notes dataset.

Global Benchmark

English has MIMIC-III/IV. Some de-identified Chinese EHR datasets exist.

Opportunity

Essential for clinical NLP, drug interaction detection, and Marathi health chatbots.

Nutrition / dietary survey (Maharashtra-specific)

sparseHealth & Nutrition

NFHS has state-level aggregates but no individual-level dietary intake data for MH.

Global Benchmark

US has NHANES with individual-level dietary recall. UK has NDNS.

Opportunity

Needed for personalized nutrition apps, malnutrition early warning, school meal planning.

Marathi educational question banks

missingEducation & Skills

No structured dataset of Marathi exam questions mapped to curriculum topics.

Global Benchmark

English has ARC, SciQ, MMLU. Chinese has GAOKAO benchmark.

Opportunity

Would power adaptive learning platforms and automated assessment in Marathi medium schools.

Student learning outcome data

sparseEducation & Skills

ASER and NAS provide samples but no continuous, longitudinal learning data.

Global Benchmark

Many OECD nations have PISA longitudinal follow-ups and national learning management data.

Opportunity

Critical for evidence-based education policy and personalized learning interventions.

Marathi financial news corpus

missingEconomy, Labour & Finance

No labeled Marathi financial sentiment or event dataset.

Global Benchmark

English has Financial PhraseBank, FiQA. Chinese has FinNL.

Opportunity

Needed for Marathi stock market analysis tools, financial news summarization.

MSME / startup registry with structured data

sparseEconomy, Labour & Finance

Udyam has registration data and Startup India lists 25K+ Maharashtra startups, mentors, and incubators, but no structured dataset with financials, outcomes, or survival data.

Global Benchmark

US has SEC filings + Crunchbase. UK has Companies House full data.

Opportunity

Would enable MSME lending models, market analysis, and entrepreneurship research.

Crop disease images (Maharashtra varieties)

missingAgriculture & Rural

No labeled image dataset of diseases on Maharashtra-specific crops (sugarcane, jowar, bajra, grapes).

Global Benchmark

PlantVillage has 50K images but mainly Western crops. China has custom rice/wheat datasets.

Opportunity

Would enable phone-based crop disease detection apps for MH farmers.

Farm-level input/output economics

sparseAgriculture & Rural

ICRISAT VDSA has some villages. No broad, recent farm-level cost-of-cultivation microdata for MH.

Global Benchmark

US has ARMS (Agricultural Resource Management Survey) at field level.

Opportunity

Essential for input subsidy optimization, credit risk modeling, and procurement planning.

Court judgments in Marathi (structured)

sparseGovernance, Census & Legal

Indian Kanoon has some Marathi judgments but no structured, NER-tagged legal dataset.

Global Benchmark

English has CaseLaw Access Project (6.7M decisions). EU has ECHR-CASES.

Opportunity

Would enable legal search, case outcome prediction, and access to justice in Marathi.

RTI response data (structured)

missingGovernance, Census & Legal

RTI requests and responses exist but are not aggregated or structured as a dataset.

Global Benchmark

UK has WhatDoTheyKnow with 800K+ structured FOI requests and responses.

Opportunity

Could power government transparency tools, automated RTI assistance, and civic engagement.

Real-time traffic flow data

missingTransport & Urban Infrastructure

No open, continuous traffic speed/flow dataset for Mumbai, Pune, or Nagpur.

Global Benchmark

UK has Highways England (15-min loop data). US has INRIX.

Opportunity

Required for traffic prediction, route optimization, and urban planning models.

Marathi literary corpus (annotated)

missingCulture, Media & Heritage

No POS-tagged or semantically annotated Marathi literary text corpus (novels, poetry, drama).

Global Benchmark

English has BNC, Project Gutenberg with annotations. Japanese has BCCWJ.

Opportunity

Needed for digital humanities research, literary analysis, and cultural preservation.

Tourism POI reviews in Marathi

missingCulture, Media & Heritage

No scraped/curated Marathi tourism review dataset for MH points of interest.

Global Benchmark

English has Yelp (6.9M reviews), TripAdvisor datasets.

Opportunity

Would power Marathi tourism recommendation engines and sentiment-based destination rankings.

Air quality station-level historical data (MH)

sparseEnvironment, Climate & Disaster

CPCB has real-time data but no cleaned, aggregated historical time-series for all MH stations.

Global Benchmark

US EPA has AQS with decades of cleaned, station-level data. EU has EEA AirBase.

Opportunity

Needed for pollution forecasting, health impact studies, and environmental policy.

Marathi LLM evaluation benchmark

sparseBenchmarks, Tools & Dialects

IndicGLUE and IndicXTREME exist but no comprehensive Marathi-specific benchmark like MMLU or C-Eval.

Global Benchmark

English has MMLU, HellaSwag, ARC. Chinese has C-Eval, CMMLU.

Opportunity

Critical for fairly evaluating and comparing Marathi language models.