AI4Bharat Naamapadam (Marathi)

MH Specific

Large-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.

Build a Marathi document entity extractor that automatically identifies people, places, and organizations in government circulars and news articles.

Homepage HuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/naamapadam', 'mr', split='train', streaming=True)
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
for i, ex in enumerate(ds):
    entities = [(t, label_names[n]) for t, n in zip(ex['tokens'], ex['ner_tags']) if n != 0]
    print(f"Entities: {entities[:5]}")
    if i >= 4: break

Modality

text

Size

455.2K training sentences

License

CC0-1.0

Format

Parquet

Language

Update Frequency

static

Organization

AI4Bharat

Schema

Field	Type	Description
tokens	list[string]	List of word tokens in the sentence
ner_tags	list[int]	BIO-format NER tags for each token (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O)

Build With This

Create a knowledge graph of Maharashtra politicians, districts, and organizations by extracting entities from Marathi news archives

Develop an automated contact directory builder that extracts names and locations from Marathi business correspondence

Build a Marathi-language resume parser that extracts candidate names, companies, and locations for recruitment platforms

AI Use Cases

Named entity recognitionInformation extractionKnowledge graph constructionDocument understanding

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-07