XL-Sum Marathi (BBC)

10,903 article-summary pairs from BBC Marathi website with professionally written, highly abstractive summaries. Part of the 45-language XL-Sum benchmark. Gold-standard editorial quality summaries that crowdsourced datasets cannot match.

Build a Marathi news summarization model trained on professional BBC Marathi summaries for high-quality output.

Homepage HuggingFace Paper

Quick Start

from datasets import load_dataset
ds = load_dataset('csebuetnlp/xlsum', 'marathi', split='train')
print(f"Articles: {len(ds)}")
for ex in ds[:3]:
    print(f"Title: {ex['title'][:60]}")
    print(f"Summary: {ex['summary'][:80]}...\n")

Modality

text

Size

~10,903 Marathi samples with train/val/test splits

License

CC-BY-NC-SA-4.0

Format

JSON

Language

Update Frequency

static

Organization

BUET CSE NLP Group

Schema

Field	Type	Description
text	string	Full BBC Marathi news article text
summary	string	Professional summary of the article
title	string	Article headline
url	string	Source BBC URL

Build With This

Create a daily Marathi news brief generator that produces BBC-quality summaries of top Maharashtra stories

Develop a headline generation model trained on BBC Marathi articles for automated news headline creation

Build a cross-lingual news summarizer that generates Marathi summaries from English BBC articles using this as training data

AI Use Cases

Marathi abstractive summarization benchmarkingNews content summarizationCross-lingual summarization evaluationFine-tuning LLMs for Marathi text generation

Related Datasets

AI4Bharat BPCC (mr)

parallel-text

AI4Bharat IndicCorp v1 (mr)

text

AI4Bharat IndicCorp v2 (Marathi)

text

AI4Bharat IndicGLUE (mr)

text

Last verified: 2026-03-09