Instruction-tuning datasets, RAG knowledge bases, and QA corpora for building Marathi AI agents.
8 datasets
Marathi — Expert-generated reading comprehension dataset for 11 Indic languages including Marathi, with context-question-answer triples for extractive QA tasks
Official documents for central government schemes (PM-KISAN, PMFBY, MGNREGA) with Maharashtra implementation guidelines, suitable for building scheme information retrieval agents
RAG Corpus — ~47,000 Government Resolutions in Marathi and English from 33 departments, structured by department and updated weekly, ideal for RAG pipelines over policy documents
Marathi translation of the Stanford Alpaca instruction-tuning dataset for fine-tuning instruction-following capabilities in Marathi language models
Knowledge Base for RAG — Full Marathi Wikipedia dump providing broad encyclopaedic coverage across diverse topics, suitable as a knowledge base for retrieval-augmented generation systems
Large-scale synthetic dataset for LLM post-training in 13 Indic languages including Marathi. Integrates translated reasoning data and synthesized open-domain generative content with multi-stage quality checks (language ID, word repetition filtering). Designed for LLM alignment and instruction-following.
Multilingual (incl. Marathi) — Human-generated, human-annotated assistant-style conversation corpus in 35 languages including Marathi conversation trees with quality ratings
Structured knowledge graph with multilingual labels (including Marathi) for Maharashtra entities — people, places, organizations, cultural artifacts, administrative divisions. Extractable via SPARQL for RAG systems, entity linking, and question answering.