AI4Bharat Naamapadam (Marathi)

AI4Bharat Naamapadam (Marathi)

MH Specific

Large-scale Marathi NER dataset with 455,200 training sentences annotated across 3 entity types (PER, LOC, ORG), part of the largest publicly available Indic NER dataset.

Build a Marathi document entity extractor that automatically identifies people, places, and organizations in government circulars and news articles.
HomepageHuggingFace

Quick Start

from datasets import load_dataset
ds = load_dataset('ai4bharat/naamapadam', 'mr', split='train', streaming=True)
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
for i, ex in enumerate(ds):
    entities = [(t, label_names[n]) for t, n in zip(ex['tokens'], ex['ner_tags']) if n != 0]
    print(f"Entities: {entities[:5]}")
    if i >= 4: break
Modality
text
Size
455.2K training sentences
License
Format
Parquet
Language
mr
Update Frequency
static
Organization
AI4Bharat

Schema

FieldTypeDescription
tokenslist[string]List of word tokens in the sentence
ner_tagslist[int]BIO-format NER tags for each token (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O)

Build With This

Create a knowledge graph of Maharashtra politicians, districts, and organizations by extracting entities from Marathi news archives
Develop an automated contact directory builder that extracts names and locations from Marathi business correspondence
Build a Marathi-language resume parser that extracts candidate names, companies, and locations for recruitment platforms

AI Use Cases

Named entity recognitionInformation extractionKnowledge graph constructionDocument understanding
Last verified: 2026-03-07