Agentic, Instruction & RAG

Instruction-tuning datasets, RAG knowledge bases, and QA corpora for building Marathi AI agents.

8 datasets

Marathi — Expert-generated reading comprehension dataset for 11 Indic languages including Marathi, with context-question-answer triples for extractive QA tasks

Build a Marathi government scheme FAQ bot using RAG
Text (Marathi)
AI4Bharat, IIT Madras

Official documents for central government schemes (PM-KISAN, PMFBY, MGNREGA) with Maharashtra implementation guidelines, suitable for building scheme information retrieval agents

Build a RAG-powered Marathi chatbot that answers citizen questions about government schemes using official documents.
Text (PDF, web)
Various Maharashtra/India Government Departments

RAG Corpus — ~47,000 Government Resolutions in Marathi and English from 33 departments, structured by department and updated weekly, ideal for RAG pipelines over policy documents

Build a searchable Maharashtra GR database with semantic search in both Marathi and English for government officers.
Text (Marathi + English)
General Administration Department, Government of Maharashtra

Marathi translation of the Stanford Alpaca instruction-tuning dataset for fine-tuning instruction-following capabilities in Marathi language models

Fine-tune a Marathi instruction-following LLM using this Alpaca-format dataset for building a Marathi AI assistant.
Text (Marathi)
Open-Source Community (Translated from Stanford Alpaca)

Knowledge Base for RAG — Full Marathi Wikipedia dump providing broad encyclopaedic coverage across diverse topics, suitable as a knowledge base for retrieval-augmented generation systems

Build a Marathi question-answering system trained on Wikipedia articles for general knowledge queries.
Text (Marathi)
Wikimedia Foundation

Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.

Build a conversational Marathi ASR system for voice assistants that handles natural, unscripted speech patterns.
text
Microsoft Research

Multilingual (incl. Marathi) — Human-generated, human-annotated assistant-style conversation corpus in 35 languages including Marathi conversation trees with quality ratings

Fine-tune a Marathi conversational AI model using OpenAssistant multilingual data including Indic languages.
Text (multilingual)
LAION / Open Assistant Community

Structured knowledge graph with multilingual labels (including Marathi) for Maharashtra entities — people, places, organizations, cultural artifacts, administrative divisions. Extractable via SPARQL for RAG systems, entity linking, and question answering.

Build a Marathi knowledge graph from Wikidata entities related to Maharashtra for entity linking and QA systems.
knowledge-graph
Wikimedia Foundation