AI4Bharat Naamapadam (Marathi)
MH SpecificLarge-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.
Build a Marathi document entity extractor that automatically identifies people, places, and organizations in government circulars and news articles.
Quick Start
from datasets import load_dataset
ds = load_dataset('ai4bharat/naamapadam', 'mr', split='train', streaming=True)
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
for i, ex in enumerate(ds):
entities = [(t, label_names[n]) for t, n in zip(ex['tokens'], ex['ner_tags']) if n != 0]
print(f"Entities: {entities[:5]}")
if i >= 4: break
Size
455.2K training sentences
Schema
| Field | Type | Description |
|---|
| tokens | list[string] | List of word tokens in the sentence |
| ner_tags | list[int] | BIO-format NER tags for each token (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O) |
Build With This
Create a knowledge graph of Maharashtra politicians, districts, and organizations by extracting entities from Marathi news archives
Develop an automated contact directory builder that extracts names and locations from Marathi business correspondence
Build a Marathi-language resume parser that extracts candidate names, companies, and locations for recruitment platforms
AI Use Cases
Named entity recognitionInformation extractionKnowledge graph constructionDocument understanding
Last verified: 2026-03-07