Agentic, Instruction & RAG

Instruction-tuning datasets, RAG knowledge bases, and QA corpora for building Marathi AI agents.

8 datasets

Marathi — Expert-generated reading comprehension dataset for 11 Indic languages including Marathi, with context-question-answer triples for extractive QA tasks

Build a Marathi government scheme FAQ bot using RAG

Text (Marathi)CC BY 4.0

AI4Bharat, IIT Madras

Government Scheme Documents for RAG

Official documents for central government schemes (PM-KISAN, PMFBY, MGNREGA) with Maharashtra implementation guidelines, suitable for building scheme information retrieval agents

Build a RAG-powered Marathi chatbot that answers citizen questions about government schemes using official documents.

Text (PDF, web)Open Government

Various Maharashtra/India Government Departments

Maharashtra Government Resolutions (mahGRs)

RAG Corpus — ~47,000 Government Resolutions in Marathi and English from 33 departments, structured by department and updated weekly, ideal for RAG pipelines over policy documents

Build a searchable Maharashtra GR database with semantic search in both Marathi and English for government officers.

Text (Marathi + English)CC BY 4.0

General Administration Department, Government of Maharashtra

Marathi Alpaca Instruction Dataset

Marathi translation of the Stanford Alpaca instruction-tuning dataset for fine-tuning instruction-following capabilities in Marathi language models

Fine-tune a Marathi instruction-following LLM using this Alpaca-format dataset for building a Marathi AI assistant.

Text (Marathi)Open Research

Open-Source Community (Translated from Stanford Alpaca)

Marathi Wikipedia

Knowledge Base for RAG — Full Marathi Wikipedia dump providing broad encyclopaedic coverage across diverse topics, suitable as a knowledge base for retrieval-augmented generation systems

Build a Marathi question-answering system trained on Wikipedia articles for general knowledge queries.

Text (Marathi)CC BY-SA 3.0

Wikimedia Foundation

Microsoft Updesh (Marathi LLM Post-Training)

Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.

Build a conversational Marathi ASR system for voice assistants that handles natural, unscripted speech patterns.

textMicrosoft Research License

Microsoft Research

OpenAssistant OASST1/OASST2

Multilingual (incl. Marathi) — Human-generated, human-annotated assistant-style conversation corpus in 35 languages including Marathi conversation trees with quality ratings

Fine-tune a Marathi conversational AI model using OpenAssistant multilingual data including Indic languages.

Text (multilingual)Apache 2.0

LAION / Open Assistant Community

Wikidata Maharashtra Entities (Knowledge Graph)

MH subset needed

Structured knowledge graph with multilingual labels (including Marathi) for Maharashtra entities — people, places, organizations, cultural artifacts, administrative divisions. Extractable via SPARQL for RAG systems, entity linking, and question answering.

Build a Marathi knowledge graph from Wikidata entities related to Maharashtra for entity linking and QA systems.

knowledge-graphCC0-1.0

Wikimedia Foundation