Human-annotated Marathi Sentence Textual Similarity dataset with 16,860 sentence pairs scored 0-5. Uniformly distributed across score buckets to reduce label bias. Essential for training sentence embeddings, semantic search, and retrieval systems in Marathi.
# Download from https://github.com/l3cube-pune/MarathiNLP
import pandas as pd
df = pd.read_csv('MahaSTS.csv')
print(f"Pairs: {len(df)}")
for _, row in df.head(5).iterrows():
print(f"Score: {row['score']:.1f} | {row['sentence1'][:40]}... <-> {row['sentence2'][:40]}...")| Field | Type | Description |
|---|---|---|
| sentence1 | string | First Marathi sentence |
| sentence2 | string | Second Marathi sentence |
| score | float | Semantic similarity score (0-5 scale) |