Language & NLP

Foundation datasets for Marathi language models, NER, sentiment analysis, machine translation, and text processing.

40 datasets

AI4Bharat BPCC (mr) dataset for language nlp.

Build a domain-specific English-to-Marathi translation model fine-tuned on government and legal text for Maharashtra administrative use.
parallel-text
AI4Bharat, IIT Madras

AI4Bharat IndicCorp v1 (mr) dataset for language nlp.

Pre-train Marathi word embeddings on IndicCorp v1 to create static and contextual embeddings for downstream NLP tasks.
text
AI4Bharat, IIT Madras

Marathi subset of the massive 20.9 billion token multilingual corpus covering 24 Indic languages, collected from diverse web sources for language model pretraining.

Train a large Marathi language model on this corpus and fine-tune it for downstream NLP tasks like summarization and QA.
text
AI4Bharat

AI4Bharat IndicGLUE (mr) dataset for language nlp.

Benchmark Marathi language models on the IndicGLUE suite to establish performance baselines across NLU tasks.
text
AI4Bharat, IIT Madras

AI4Bharat IndicHeadlineGeneration (mr) dataset for language nlp.

Build an automatic Marathi headline generator for news aggregation platforms to create concise, accurate headlines from article text.
text
AI4Bharat, IIT Madras

AI4Bharat IndicParaphrase (mr) dataset for language nlp.

Build a Marathi duplicate content detector for news agencies to identify when multiple outlets cover the same story differently.
text
AI4Bharat, IIT Madras

AI4Bharat IndicSentenceSummarization (mr) dataset for language nlp.

Build a Marathi news digest app that summarizes long articles into single sentences for quick mobile reading.
text
AI4Bharat, IIT Madras

AI4Bharat IndicSentiment (mr) dataset for language nlp.

Build a Marathi product review sentiment analyzer for regional e-commerce platforms to understand customer satisfaction.
text
AI4Bharat, IIT Madras

Large-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.

Build a Marathi document entity extractor that automatically identifies people, places, and organizations in government circulars and news articles.
text
AI4Bharat

Largest publicly available English-Marathi parallel corpus with 3.32 million sentence pairs for machine translation.

Build an English-to-Marathi translation API for government schemes
parallel-text
AI4Bharat

CC-100 (mr) dataset for language nlp.

Use CC-100 Marathi data to train sentence-level embeddings for building a Marathi semantic search system.
text
Meta AI / Johns Hopkins University

Large-scale multilingual web corpus with a substantial Marathi subset, combining mC4 and OSCAR data with extensive cleaning and deduplication. Provides high-quality pre-training data for Marathi language models with better filtering than raw web crawls.

Build a high-quality Marathi pre-training corpus by combining CulturaX with quality filtering for LLM training.
text
UoNLP (University of Oregon NLP)

Google Dakshina (mr) dataset for language nlp.

Build a Marathi transliteration engine that converts Roman-script Marathi (commonly typed on phones) to proper Devanagari.
text, transliteration
Google Research

Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.

Build an automated content moderation system for Marathi social media platforms to flag offensive language in real-time.
text
HASOC / FIRE

First Marathi grammar error correction dataset, part of a multilingual GEC benchmark covering Hindi, Bengali, Marathi, and Tamil. Provides source-target corrected sentence pairs for training spelling and grammar checkers. Fills a critical gap since no GEC resources for Marathi previously existed.

Build a Marathi grammar correction API for word processors and messaging apps used in Maharashtra.
text
EMNLP 2025

Marathi news classification dataset containing ~12,000 news article headlines collected from a Marathi news website, labeled across 3 categories (state 62%, entertainment 27%, sports 10%). Part of the iNLTK (Indic Natural Language Toolkit) project.

Build a Marathi news topic classifier that automatically categorizes incoming news articles for a Marathi news aggregator app.
text
iNLTK / DISISBIG

Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.

Fine-tune a Marathi language model for your domain
text
L3Cube, Pune

L3Cube-MahaEmotions dataset for language nlp.

Build an emotion-aware Marathi chatbot for mental health support that adapts its responses based on detected user emotions.
text
L3Cube, Pune

L3Cube-MahaHate dataset for language nlp.

Build a Marathi content moderation API that flags hate speech in real-time for community forum platforms.
text
L3Cube, Pune

Manually annotated Marathi named entity recognition dataset with 25,000 sentences tagged across 8 entity classes.

Build a named entity recognition service to extract person, organization, and location names from Marathi government documents
text
L3Cube, Pune

L3Cube-MahaNews dataset for language nlp.

Build a Marathi news aggregation and auto-tagging service that categorizes articles by topic for personalized feeds.
text
L3Cube, Pune

L3Cube-MahaParaphrase dataset for language nlp.

Build a semantic search engine for Marathi educational content that finds similar questions and answers across textbooks.
text
L3Cube, Pune

Multi-domain Marathi sentiment analysis dataset with 60,000 samples across 4 domains, labeled for positive, negative, and neutral sentiment.

Build a Marathi product review analyzer for e-commerce
text
L3Cube, Pune

L3Cube-MahaSent-MD dataset for language nlp.

Build a Marathi product review sentiment analyzer for e-commerce platforms selling in Maharashtra.
text
L3Cube, Pune

Social media-based Marathi Named Entity Recognition dataset with annotations for entities in informal, code-mixed social media text. Addresses the gap between formal NER (like Naamapadam) and real-world social media Marathi usage with noisy, informal text patterns.

Build a social media entity tracker for Marathi Twitter/X that identifies mentions of people, organizations, and locations in real-time.
text
L3Cube, Pune

Large-scale Marathi question answering dataset with 118,516 training, 11,873 validation, and 11,803 test QA samples, modeled after SQuAD.

Build a Marathi question-answering system for agricultural advisory queries from farmers.
text
L3Cube, Pune

Human-annotated Marathi Sentence Textual Similarity dataset with 16,860 sentence pairs scored 0-5. Uniformly distributed across score buckets to reduce label bias. Essential for training sentence embeddings, semantic search, and retrieval systems in Marathi.

Build a Marathi semantic search engine that finds similar documents using sentence embeddings trained on this similarity data.
text
L3Cube, Pune

Largest Marathi news summarization dataset containing 25,374 news articles from Lokmat and Loksatta with manually verified abstractive summaries. Covers politics, economics, culture, sports, and more. First large-scale abstractive summarization dataset for Marathi.

Build a Marathi news summarization API that generates concise summaries from full-length news articles for mobile news apps.
text
L3Cube, Pune

First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.

Build a code-mixed Marathi-English sentiment analyzer for social media where users frequently mix both languages.
text
L3Cube, Pune

Google's manually audited 3 trillion token monolingual dataset from CommonCrawl covering 419 languages. Document-level with language identification and quality auditing. Manual auditing ensures higher Marathi data quality than fully automated pipelines.

Use MADLAD-400 Marathi data for large-scale language model pre-training with quality-filtered web text.
text
Google / Allen AI

Marathi Wikipedia Dump dataset for language nlp.

Use the raw Marathi Wikipedia dump to build a structured knowledge graph of Maharashtra-related entities and relationships.
text
Wikimedia Foundation

Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.

Pre-train a Marathi-specific language model on the mC4 corpus to create a foundation model for downstream NLP tasks like classification and QA.
text
Google / Allen AI

Meta FLORES-200 (mr) dataset for language nlp.

Evaluate and benchmark Marathi machine translation models using FLORES-200 as a standardized test set.
parallel-text
Meta AI

Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.

Build a multi-domain English-Marathi translation model by combining parallel corpora from different OPUS sources.
text
OPUS / NLPL

OSCAR 23.01 (mr) dataset for language nlp.

Build a quality-filtered Marathi pre-training corpus from OSCAR with document-level quality scoring.
text
OSCAR Project / Inria

Marathi subset of the OSCAR 23.01 web-crawled corpus containing 729,578 documents and 252 million words (4.5 GB), derived from Common Crawl with language filtering.

Build a Marathi text corpus quality pipeline that deduplicates and filters OSCAR data to create a clean pre-training dataset.
text
OSCAR Project (Inria / DFKI)

English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.

Build a government domain-specific English-Marathi translator trained on official PM India communications.
text
University of Edinburgh

Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.

Use Sangraha's curated Marathi web corpus for training a high-quality Marathi language model with better data quality than raw crawls.
text
AI4Bharat

UD Marathi-UFAL Treebank dataset for language nlp.

Build a Marathi grammar checker using dependency parsing from the Universal Dependencies treebank.
text (CoNLL-U)
UFAL, Charles University

10,903 article-summary pairs from BBC Marathi website with professionally written, highly abstractive summaries. Part of the 45-language XL-Sum benchmark. Gold-standard editorial quality summaries that crowdsourced datasets cannot match.

Build a Marathi news summarization model trained on professional BBC Marathi summaries for high-quality output.
text
BUET CSE NLP Group