Foundation datasets for Marathi language models, NER, sentiment analysis, machine translation, and text processing.
40 datasets
AI4Bharat BPCC (mr) dataset for language nlp.
AI4Bharat IndicCorp v1 (mr) dataset for language nlp.
Marathi subset of the massive 20.9 billion token multilingual corpus covering 24 Indic languages, collected from diverse web sources for language model pretraining.
AI4Bharat IndicGLUE (mr) dataset for language nlp.
AI4Bharat IndicHeadlineGeneration (mr) dataset for language nlp.
AI4Bharat IndicParaphrase (mr) dataset for language nlp.
AI4Bharat IndicSentenceSummarization (mr) dataset for language nlp.
AI4Bharat IndicSentiment (mr) dataset for language nlp.
Large-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.
Largest publicly available English-Marathi parallel corpus with 3.32 million sentence pairs for machine translation.
CC-100 (mr) dataset for language nlp.
Large-scale multilingual web corpus with a substantial Marathi subset, combining mC4 and OSCAR data with extensive cleaning and deduplication. Provides high-quality pre-training data for Marathi language models with better filtering than raw web crawls.
Google Dakshina (mr) dataset for language nlp.
Shared task datasets for Hate Speech and Offensive Content identification in Marathi from HASOC 2021 and 2022. Uses OLID taxonomy with ~4,970 annotated tweets. Complementary to L3Cube MahaHate using different annotation scheme and data sources, with published competitive baselines from multiple research teams.
First Marathi grammar error correction dataset, part of a multilingual GEC benchmark covering Hindi, Bengali, Marathi, and Tamil. Provides source-target corrected sentence pairs for training spelling and grammar checkers. Fills a critical gap since no GEC resources for Marathi previously existed.
Marathi news classification dataset containing ~12,000 news article headlines collected from a Marathi news website, labeled across 3 categories (state 62%, entertainment 27%, sports 10%). Part of the iNLTK (Indic Natural Language Toolkit) project.
Large-scale Marathi monolingual text corpus with 24.8 million sentences and 289 million tokens, curated for language model pretraining.
L3Cube-MahaEmotions dataset for language nlp.
L3Cube-MahaHate dataset for language nlp.
Manually annotated Marathi named entity recognition dataset with 25,000 sentences tagged across 8 entity classes.
L3Cube-MahaNews dataset for language nlp.
L3Cube-MahaParaphrase dataset for language nlp.
Multi-domain Marathi sentiment analysis dataset with 60,000 samples across 4 domains, labeled for positive, negative, and neutral sentiment.
L3Cube-MahaSent-MD dataset for language nlp.
Social media-based Marathi Named Entity Recognition dataset with annotations for entities in informal, code-mixed social media text. Addresses the gap between formal NER (like Naamapadam) and real-world social media Marathi usage with noisy, informal text patterns.
Large-scale Marathi question answering dataset with 118,516 training, 11,873 validation, and 11,803 test QA samples, modeled after SQuAD.
Human-annotated Marathi Sentence Textual Similarity dataset with 16,860 sentence pairs scored 0-5. Uniformly distributed across score buckets to reduce label bias. Essential for training sentence embeddings, semantic search, and retrieval systems in Marathi.
Largest Marathi news summarization dataset containing 25,374 news articles from Lokmat and Loksatta with manually verified abstractive summaries. Covers politics, economics, culture, sports, and more. First large-scale abstractive summarization dataset for Marathi.
First comprehensive Marathi-English code-mixed NLP ecosystem. MeCorpus provides 10M sentence unsupervised pre-training corpus. Includes supervised benchmarks MeSent (~12k tweets for sentiment), MeHate (~12k for hate speech), and MeLID (~12k for language identification). Covers both Devanagari and Roman script mixed text.
Google's manually audited 3 trillion token monolingual dataset from CommonCrawl covering 419 languages. Document-level with language identification and quality auditing. Manual auditing ensures higher Marathi data quality than fully automated pipelines.
Marathi Wikipedia Dump dataset for language nlp.
Marathi subset of the multilingual Colossal Clean Crawled Corpus (mC4) with approximately 7.8 million documents and 14 billion raw tokens, used for large-scale language model pretraining.
Meta FLORES-200 (mr) dataset for language nlp.
Collection of parallel corpora from the OPUS project containing English-Marathi aligned text from multiple sources including JW300, GNOME, KDE4, Ubuntu, Tanzil (Quran), Bible, WikiMatrix, and CCAligned. Provides diverse domain coverage for machine translation training.
OSCAR 23.01 (mr) dataset for language nlp.
Marathi subset of the OSCAR 23.01 web-crawled corpus containing 729,578 documents and 252 million words (4.5 GB), derived from Common Crawl with language filtering.
English-Marathi parallel corpus extracted from the Prime Minister of India website (pmindia.gov.in) containing up to 56,000 aligned sentence pairs. Covers speeches, press releases, and official communications, providing a domain-specific parallel corpus for machine translation.
Large-scale Indic web corpus by AI4Bharat with curated Marathi subset from Common Crawl, featuring language identification, quality filtering, and deduplication. Includes verified, unverified, and synthetic (back-translated) splits for comprehensive Marathi language model training.
UD Marathi-UFAL Treebank dataset for language nlp.
10,903 article-summary pairs from BBC Marathi website with professionally written, highly abstractive summaries. Part of the 45-language XL-Sum benchmark. Gold-standard editorial quality summaries that crowdsourced datasets cannot match.