Data Gaps Analysis

Data Gaps Analysis

Comparing the Marathi & Maharashtra data ecosystem against well-resourced languages and regions. These gaps represent opportunities — datasets that could be created, gathered, or digitized to unlock new AI capabilities for Marathi speakers.

15
Missing
12
Sparse
27
Total Gaps
Language & NLP Speech & Audio Vision, OCR & Multimodal Geospatial & GIS Agriculture & Rural Health & Nutrition Education & Skills Economy, Labour & Finance Environment, Climate & Disaster Transport & Urban Infrastructure Governance, Census & Legal Culture, Media & Heritage Real-Time Streams & APIs Agentic, Instruction & RAG Benchmarks, Tools & Dialects

Conversational / Dialogue datasets

No dedicated multi-turn Marathi dialogue or conversational datasets exist.

Global Benchmark

English has DailyDialog, PersonaChat, MultiWOZ; Hindi has limited options.

Opportunity

Critical for building Marathi chatbots, virtual assistants, and customer service AI.

Commonsense reasoning

No Marathi commonsense knowledge graphs or reasoning datasets.

Global Benchmark

English has ConceptNet, ATOMIC, WinoGrande, HellaSwag.

Opportunity

Needed for Marathi LLMs to understand cultural context and implicit knowledge.

Hate speech / toxicity (large-scale)

L3Cube-MahaHate exists but is small. No large-scale, multi-platform toxicity dataset.

Global Benchmark

English has Jigsaw (1.8M comments), HateXplain with rationale annotations.

Opportunity

Essential for content moderation on Marathi social media platforms.

Summarization corpus

IndicSentenceSummarization has Marathi but no long-document summarization dataset.

Global Benchmark

English has CNN/DailyMail, XSum; Chinese has LCSTS.

Opportunity

Required for Marathi news summarization, document understanding, and report generation.

Emotional speech

No Marathi speech emotion recognition (SER) dataset with emotion labels.

Global Benchmark

English has IEMOCAP, RAVDESS, CREMA-D; Hindi has limited options.

Opportunity

Needed for call center sentiment analysis, mental health monitoring, accessibility.

Noisy / real-world speech

Existing Marathi ASR datasets are mostly read speech. No noisy, spontaneous, or code-mixed speech data.

Global Benchmark

English has CHiME, VoxCeleb, AMI Meeting Corpus.

Opportunity

Real-world Marathi speech recognition requires training on market, street, farm environments.

Text-to-Speech (multi-speaker)

IndicTTS has 1 Marathi speaker. No multi-speaker, multi-style TTS corpus.

Global Benchmark

English has LibriTTS (2,456 speakers), VCTK (110 speakers).

Opportunity

Needed for natural-sounding Marathi voice assistants and audiobook generation.

Devanagari scene text in the wild

BSTD and IIIT-HW exist but are small. No large-scale Marathi street sign / billboard dataset.

Global Benchmark

Chinese has CTW, RCTW with 100K+ street images.

Opportunity

Required for Marathi navigation apps, automated sign reading, smart city infrastructure.

Medical imaging with Marathi reports

No paired medical image + Marathi radiology report dataset exists.

Global Benchmark

English has MIMIC-CXR (377K images), CheXpert.

Opportunity

Would enable AI-assisted radiology reporting in Marathi for rural hospital networks.

Document layout analysis (Marathi)

No dataset for parsing Marathi document layouts (government forms, certificates, gazettes).

Global Benchmark

English has PubLayNet, DocBank. Chinese has CDLA.

Opportunity

Key for digitizing Maharashtra government records and automating form processing.

High-res building footprints for Maharashtra

OpenStreetMap has partial coverage. No comprehensive, official building footprint dataset.

Global Benchmark

US has Microsoft Building Footprints (125M), Google Open Buildings covers Africa.

Opportunity

Needed for urban planning, disaster response, property tax assessment in MH.

Agricultural land parcel boundaries

No digitized, geo-referenced field boundary dataset for Maharashtra farmland.

Global Benchmark

EU has LPIS with field-level boundaries. US has CLU (Common Land Unit).

Opportunity

Would transform precision agriculture, crop insurance, and land record modernization.

Electronic health records (Marathi)

No anonymized Marathi EHR or clinical notes dataset.

Global Benchmark

English has MIMIC-III/IV. Some de-identified Chinese EHR datasets exist.

Opportunity

Essential for clinical NLP, drug interaction detection, and Marathi health chatbots.

Nutrition / dietary survey (Maharashtra-specific)

NFHS has state-level aggregates but no individual-level dietary intake data for MH.

Global Benchmark

US has NHANES with individual-level dietary recall. UK has NDNS.

Opportunity

Needed for personalized nutrition apps, malnutrition early warning, school meal planning.

Marathi educational question banks

No structured dataset of Marathi exam questions mapped to curriculum topics.

Global Benchmark

English has ARC, SciQ, MMLU. Chinese has GAOKAO benchmark.

Opportunity

Would power adaptive learning platforms and automated assessment in Marathi medium schools.

Student learning outcome data

ASER and NAS provide samples but no continuous, longitudinal learning data.

Global Benchmark

Many OECD nations have PISA longitudinal follow-ups and national learning management data.

Opportunity

Critical for evidence-based education policy and personalized learning interventions.

Marathi financial news corpus

No labeled Marathi financial sentiment or event dataset.

Global Benchmark

English has Financial PhraseBank, FiQA. Chinese has FinNL.

Opportunity

Needed for Marathi stock market analysis tools, financial news summarization.

MSME / startup registry with structured data

Udyam has registration data but no structured dataset with financials, sectors, or outcomes.

Global Benchmark

US has SEC filings + Crunchbase. UK has Companies House full data.

Opportunity

Would enable MSME lending models, market analysis, and entrepreneurship research.

Crop disease images (Maharashtra varieties)

No labeled image dataset of diseases on Maharashtra-specific crops (sugarcane, jowar, bajra, grapes).

Global Benchmark

PlantVillage has 50K images but mainly Western crops. China has custom rice/wheat datasets.

Opportunity

Would enable phone-based crop disease detection apps for MH farmers.

Farm-level input/output economics

ICRISAT VDSA has some villages. No broad, recent farm-level cost-of-cultivation microdata for MH.

Global Benchmark

US has ARMS (Agricultural Resource Management Survey) at field level.

Opportunity

Essential for input subsidy optimization, credit risk modeling, and procurement planning.

Court judgments in Marathi (structured)

Indian Kanoon has some Marathi judgments but no structured, NER-tagged legal dataset.

Global Benchmark

English has CaseLaw Access Project (6.7M decisions). EU has ECHR-CASES.

Opportunity

Would enable legal search, case outcome prediction, and access to justice in Marathi.

RTI response data (structured)

RTI requests and responses exist but are not aggregated or structured as a dataset.

Global Benchmark

UK has WhatDoTheyKnow with 800K+ structured FOI requests and responses.

Opportunity

Could power government transparency tools, automated RTI assistance, and civic engagement.

Real-time traffic flow data

No open, continuous traffic speed/flow dataset for Mumbai, Pune, or Nagpur.

Global Benchmark

UK has Highways England (15-min loop data). US has INRIX.

Opportunity

Required for traffic prediction, route optimization, and urban planning models.

Marathi literary corpus (annotated)

No POS-tagged or semantically annotated Marathi literary text corpus (novels, poetry, drama).

Global Benchmark

English has BNC, Project Gutenberg with annotations. Japanese has BCCWJ.

Opportunity

Needed for digital humanities research, literary analysis, and cultural preservation.

Tourism POI reviews in Marathi

No scraped/curated Marathi tourism review dataset for MH points of interest.

Global Benchmark

English has Yelp (6.9M reviews), TripAdvisor datasets.

Opportunity

Would power Marathi tourism recommendation engines and sentiment-based destination rankings.

Air quality station-level historical data (MH)

CPCB has real-time data but no cleaned, aggregated historical time-series for all MH stations.

Global Benchmark

US EPA has AQS with decades of cleaned, station-level data. EU has EEA AirBase.

Opportunity

Needed for pollution forecasting, health impact studies, and environmental policy.

Marathi LLM evaluation benchmark

IndicGLUE and IndicXTREME exist but no comprehensive Marathi-specific benchmark like MMLU or C-Eval.

Global Benchmark

English has MMLU, HellaSwag, ARC. Chinese has C-Eval, CMMLU.

Opportunity

Critical for fairly evaluating and comparing Marathi language models.