Language & NLP - Awesome Marathi Datasets

Language & NLP

Foundation datasets for Marathi language models, NER, sentiment analysis, machine translation, and text processing.

39 datasets

AI4Bharat BPCC (mr) dataset for language nlp.

Build a domain-specific English-to-Marathi translation model fine-tuned on government and legal text for Maharashtra administrative use.

parallel-textCC0-1.0

AI4Bharat, IIT Madras

AI4Bharat IndicCorp v1 (mr)

AI4Bharat IndicCorp v1 (mr) dataset for language nlp.

Pre-train Marathi word embeddings on IndicCorp v1 to create static and contextual embeddings for downstream NLP tasks.

textCC-BY-NC-4.0

AI4Bharat, IIT Madras

AI4Bharat IndicCorp v2 (Marathi)

Marathi subset of the massive 20.9 billion token multilingual corpus covering 24 Indic languages, collected from diverse web sources for language model pretraining.

Train a large Marathi language model on this corpus and fine-tune it for downstream NLP tasks like summarization and QA.

textCC0-1.0

AI4Bharat

AI4Bharat IndicGLUE (mr)

AI4Bharat IndicGLUE (mr) dataset for language nlp.

Benchmark Marathi language models on the IndicGLUE suite to establish performance baselines across NLU tasks.

textCC0-1.0

AI4Bharat, IIT Madras

AI4Bharat IndicHeadlineGeneration (mr)

AI4Bharat IndicHeadlineGeneration (mr) dataset for language nlp.

Build an automatic Marathi headline generator for news aggregation platforms to create concise, accurate headlines from article text.

textCC-BY-NC-4.0

AI4Bharat, IIT Madras

AI4Bharat IndicParaphrase (mr)

AI4Bharat IndicParaphrase (mr) dataset for language nlp.

Build a Marathi duplicate content detector for news agencies to identify when multiple outlets cover the same story differently.

textCC0-1.0

AI4Bharat, IIT Madras

AI4Bharat IndicSentenceSummarization (mr)

AI4Bharat IndicSentenceSummarization (mr) dataset for language nlp.

Build a Marathi news digest app that summarizes long articles into single sentences for quick mobile reading.

textCC-BY-NC-4.0

AI4Bharat, IIT Madras

AI4Bharat IndicSentiment (mr)

AI4Bharat IndicSentiment (mr) dataset for language nlp.

Build a Marathi product review sentiment analyzer for regional e-commerce platforms to understand customer satisfaction.

textCC0-1.0

AI4Bharat, IIT Madras

AI4Bharat Naamapadam (Marathi)

Large-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.

Build a Marathi document entity extractor that automatically identifies people, places, and organizations in government circulars and news articles.

textCC0-1.0

AI4Bharat

AI4Bharat Samanantar (Marathi)

Largest publicly available English-Marathi parallel corpus with 3.32 million sentence pairs for machine translation.

Build an English-to-Marathi translation API for government schemes

parallel-textCC0-1.0

AI4Bharat

CC-100 (mr)

CC-100 (mr) dataset for language nlp.

Use CC-100 Marathi data to train sentence-level embeddings for building a Marathi semantic search system.

textMIT

Meta AI / Johns Hopkins University

CulturaX Marathi

Large-scale multilingual web corpus with a substantial Marathi subset, combining mC4 and OSCAR data with extensive cleaning and deduplication. Provides high-quality pre-training data for Marathi language models with better filtering than raw web crawls.

Build a high-quality Marathi pre-training corpus by combining CulturaX with quality filtering for LLM training.

textVarious (mC4 + OSCAR terms)

UoNLP (University of Oregon NLP)

Google Dakshina (mr)

Google Dakshina (mr) dataset for language nlp.

Build a Marathi transliteration engine that converts Roman-script Marathi (commonly typed on phones) to proper Devanagari.

text, transliterationCC-BY-SA-4.0

Google Research

HASOC Marathi Offensive Language Dataset

Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.

Build an automated content moderation system for Marathi social media platforms to flag offensive language in real-time.

textResearch

HASOC / FIRE

IndiGEC Marathi Grammar Error Correction

First Marathi grammar error correction dataset, part of a multilingual GEC benchmark covering Hindi, Bengali, Marathi, and Tamil. Provides source-target corrected sentence pairs for training spelling and grammar checkers. Fills a critical gap since no GEC resources for Marathi previously existed.

Build a Marathi grammar correction API for word processors and messaging apps used in Maharashtra.

textResearch (EMNLP 2025)

EMNLP 2025

iNLTK Marathi News Headlines

Marathi news classification dataset containing ~12,000 news article headlines collected from a Marathi news website, labeled across 3 categories (state 62%, entertainment 27%, sports 10%). Part of the iNLTK (Indic Natural Language Toolkit) project.

Build a Marathi news topic classifier that automatically categorizes incoming news articles for a Marathi news aggregator app.

textCC-BY-SA-4.0

iNLTK / DISISBIG

L3Cube-MahaCorpus

Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.

Fine-tune a Marathi language model for your domain

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaEmotions

L3Cube-MahaEmotions dataset for language nlp.

Build an emotion-aware Marathi chatbot for mental health support that adapts its responses based on detected user emotions.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaHate

L3Cube-MahaHate dataset for language nlp.

Build a Marathi content moderation API that flags hate speech in real-time for community forum platforms.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaNER

Manually annotated Marathi named entity recognition dataset with 25,000 sentences tagged across 8 entity classes.

Build a named entity recognition service to extract person, organization, and location names from Marathi government documents

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaNews

L3Cube-MahaNews dataset for language nlp.

Build a Marathi news aggregation and auto-tagging service that categorizes articles by topic for personalized feeds.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaParaphrase

L3Cube-MahaParaphrase dataset for language nlp.

Build a semantic search engine for Marathi educational content that finds similar questions and answers across textbooks.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaSent-MD

Multi-domain Marathi sentiment analysis dataset with 60,000 samples across 4 domains, labeled for positive, negative, and neutral sentiment.

Build a Marathi product review analyzer for e-commerce

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaSocialNER

Social media-based Marathi Named Entity Recognition dataset with annotations for entities in informal, code-mixed social media text. Addresses the gap between formal NER (like Naamapadam) and real-world social media Marathi usage with noisy, informal text patterns.

Build a social media entity tracker for Marathi Twitter/X that identifies mentions of people, organizations, and locations in real-time.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaSQuAD

Large-scale Marathi question answering dataset with 118,516 training, 11,873 validation, and 11,803 test QA samples, modeled after SQuAD.

Build a Marathi question-answering system for agricultural advisory queries from farmers.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube-MahaSTS (Sentence Similarity)

Human-annotated Marathi Sentence Textual Similarity dataset with 16,860 sentence pairs scored 0-5. Uniformly distributed across score buckets to reduce label bias. Essential for training sentence embeddings, semantic search, and retrieval systems in Marathi.

Build a Marathi semantic search engine that finds similar documents using sentence embeddings trained on this similarity data.

textCC-BY-4.0

L3Cube, Pune

L3Cube-MahaSum

Largest Marathi news summarization dataset containing 25,374 news articles from Lokmat and Loksatta with manually verified abstractive summaries. Covers politics, economics, culture, sports, and more. First large-scale abstractive summarization dataset for Marathi.

Build a Marathi news summarization API that generates concise summaries from full-length news articles for mobile news apps.

textCC-BY-NC-SA-4.0

L3Cube, Pune

L3Cube MeCorpus (Marathi-English Code-Mixed)

First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.

Build a code-mixed Marathi-English sentiment analyzer for social media where users frequently mix both languages.

textCC-BY-NC-SA-4.0

L3Cube, Pune

LDC-IL Marathi Gold Standard Text Corpus

Curated Marathi text corpus of 2,157,109 words across 678 titles collected from books, magazines, and newspapers by the Linguistic Data Consortium for Indian Languages (LDC-IL) under the Ministry of Electronics and IT. The corpus includes newspaper-sourced Marathi text that can serve as ground-truth for synthetic OCR training data generation — render the text using Devanagari fonts with degradation effects to create paired image-text training data. Also valuable as a language model corpus for post-OCR beam search decoding and error correction.

Generate millions of synthetic Marathi OCR training images by rendering LDC-IL text in diverse Devanagari fonts with degradation.

Text (Marathi)Gov Open (LDC-IL terms)

LDC-IL, Ministry of Electronics and IT, Government of India

MADLAD-400 Marathi

Google's manually audited 3 trillion token monolingual dataset from CommonCrawl covering 419 languages. Document-level with language identification and quality auditing. Manual auditing ensures higher Marathi data quality than fully automated pipelines.

Use MADLAD-400 Marathi data for large-scale language model pre-training with quality-filtered web text.

textCC-BY-4.0

Google / Allen AI

Marathi Wikipedia Dump

Marathi Wikipedia Dump dataset for language nlp.

Use the raw Marathi Wikipedia dump to build a structured knowledge graph of Maharashtra-related entities and relationships.

textCC-BY-SA-3.0

Wikimedia Foundation

mC4 (Marathi)

Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.

Pre-train a Marathi-specific language model on the mC4 corpus to create a foundation model for downstream NLP tasks like classification and QA.

textODC-BY-1.0

Google / Allen AI

Meta FLORES-200 (mr)

Meta FLORES-200 (mr) dataset for language nlp.

Evaluate and benchmark Marathi machine translation models using FLORES-200 as a standardized test set.

parallel-textCC-BY-SA-4.0

Meta AI

OPUS Marathi Parallel Corpora

Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.

Build a multi-domain English-Marathi translation model by combining parallel corpora from different OPUS sources.

textVarious (per sub-corpus)

OPUS / NLPL

OSCAR 23.01 (Marathi)

Marathi subset of the OSCAR 23.01 web-crawled corpus containing 729,578 documents and 252 million words (4.5 GB), derived from Common Crawl with language filtering.

Build a Marathi text corpus quality pipeline that deduplicates and filters OSCAR data to create a clean pre-training dataset.

textCC0-1.0

OSCAR Project (Inria / DFKI)

PMIndia Marathi Parallel Corpus

English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.

Build a government domain-specific English-Marathi translator trained on official PM India communications.

textOpen

University of Edinburgh

Sangraha Marathi Web Corpus

Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.

Use Sangraha's curated Marathi web corpus for training a high-quality Marathi language model with better data quality than raw crawls.

textCC-BY-4.0

AI4Bharat

UD Marathi-UFAL Treebank

UD Marathi-UFAL Treebank dataset for language nlp.

Build a Marathi grammar checker using dependency parsing from the Universal Dependencies treebank.

text (CoNLL-U)CC-BY-SA-4.0

UFAL, Charles University

XL-Sum Marathi (BBC)

10,903 article-summary pairs from BBC Marathi website with professionally written, highly abstractive summaries. Part of the 45-language XL-Sum benchmark. Gold-standard editorial quality summaries that crowdsourced datasets cannot match.

Build a Marathi news summarization model trained on professional BBC Marathi summaries for high-quality output.

textCC-BY-NC-SA-4.0

BUET CSE NLP Group