L3Cube-MahaSTS (Sentence Similarity)

L3Cube-MahaSTS (Sentence Similarity)

MH Specific

Human-annotated Marathi Sentence Textual Similarity dataset with 16,860 sentence pairs scored 0-5. Uniformly distributed across score buckets to reduce label bias. Essential for training sentence embeddings, semantic search, and retrieval systems in Marathi.

Build a Marathi semantic search engine that finds similar documents using sentence embeddings trained on this similarity data.

Quick Start

# Download from https://github.com/l3cube-pune/MarathiNLP
import pandas as pd
df = pd.read_csv('MahaSTS.csv')
print(f"Pairs: {len(df)}")
for _, row in df.head(5).iterrows():
    print(f"Score: {row['score']:.1f} | {row['sentence1'][:40]}... <-> {row['sentence2'][:40]}...")
Modality
text
Size
16,860 sentence pairs with 0-5 similarity scores
License
Format
CSV / JSON
Language
mr
Update Frequency
static
Organization
L3Cube, Pune

Schema

FieldTypeDescription
sentence1stringFirst Marathi sentence
sentence2stringSecond Marathi sentence
scorefloatSemantic similarity score (0-5 scale)

Build With This

Create a Marathi FAQ chatbot that matches user questions to pre-written answers using semantic similarity scoring
Develop a duplicate detection system for Marathi government complaint portals that groups similar citizen grievances
Build a Marathi content recommendation engine that suggests similar articles based on sentence-level semantic similarity

AI Use Cases

Marathi sentence embedding trainingSemantic search and retrievalDuplicate detection in Marathi contentDocument similarity scoring
Last verified: 2026-03-09