L3Cube-MahaNER - Awesome Marathi Datasets

L3Cube-MahaNER

MH Specific

Manually annotated Marathi named entity recognition dataset with 25,000 sentences tagged across 8 entity classes.

Build a named entity recognition service to extract person, organization, and location names from Marathi government documents

Homepage HuggingFace GitHub

Quick Start

from datasets import load_dataset
ds = load_dataset('l3cube-pune/marathi-ner')
example = ds['train'][0]
print(f"Tokens: {example['tokens']}")
print(f"Tags: {example['ner_tags']}")

Modality

text

Size

25,000 sentences

License

CC-BY-NC-SA-4.0

Format

JSON

Language

Update Frequency

static

Organization

L3Cube, Pune

Schema

Field	Type	Description
tokens	list[string]	List of tokenized words in the sentence
ner_tags	list[int]	BIO-scheme entity tags per token (8 entity classes including PER, ORG, LOC)

Build With This

Create a knowledge graph from Marathi news articles using this NER dataset

Develop a Marathi legal document parser that extracts entities for contract analysis

Build a fine-tuned NER model for identifying crop names and agricultural organizations in Marathi advisory texts

AI Use Cases

Named entity recognitionInformation extractionKnowledge graph constructionDocument understanding

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07