Skip to main content
Version: 9.2.11

Vietnamese News Dataset

A dataset of Vietnamese news articles collected from 6 major Vietnamese newspapers for NLP research.

Dataset Description

Dataset Summary

This dataset contains 3,268 Vietnamese news articles covering various topics including politics, business, sports, entertainment, education, health, and technology. It is designed for Vietnamese NLP research tasks such as:

  • Text classification (news categorization)
  • Language modeling
  • Text generation
  • Named entity recognition
  • Keyword extraction

Languages

Vietnamese (vi)

Dataset Structure

Data Instance

{
"id": "article-00001",
"source": "vnexpress",
"url": "https://vnexpress.net/example-123.html",
"category": "Thời sự",
"title": "Tiêu đề bài báo",
"content": "Nội dung bài báo...",
"publish_date": "2026-01-28"
}

Data Fields

FieldTypeDescription
idstringUnique identifier (article-XXXXX)
sourcestringNews source
urlstringOriginal article URL
categorystringNews category
titlestringArticle title
contentstringFull article content
publish_datestringPublication date (yyyy-mm-dd)

Data Splits

SplitCount
train3,268

Sources

SourceWebsiteCount
VnExpressvnexpress.net1,000
Dân Trídantri.com.vn976
Thanh Niênthanhnien.vn579
Tuổi Trẻtuoitre.vn210
Tiền Phongtienphong.vn240
Người Lao Độngnld.com.vn263

Categories

  • Thời sự (Politics)
  • Thế giới (World)
  • Kinh doanh (Business)
  • Giải trí (Entertainment)
  • Thể thao (Sports)
  • Pháp luật (Law)
  • Giáo dục (Education)
  • Sức khỏe (Health)
  • Đời sống (Lifestyle)
  • Khoa học (Science/Technology)

Usage

from datasets import load_dataset

dataset = load_dataset("undertheseanlp/UVN-1")

# Access articles
for article in dataset["train"]:
print(article["id"], article["title"])
break

# Filter by source
vnexpress = dataset["train"].filter(lambda x: x["source"] == "vnexpress")

# Filter by category
sports = dataset["train"].filter(lambda x: x["category"] == "Thể thao")

Considerations

Biases

  • Reflects editorial perspectives of mainstream Vietnamese news outlets
  • May be biased toward urban and national news coverage

Limitations

  • Articles from 2025-2026 time period
  • Uneven distribution across sources due to website structure differences

License

CC-BY-NC-4.0. For research and educational purposes only.

Citation

@misc{vietnamese_news_dataset,
title={Vietnamese News Dataset},
author={Underthesea NLP},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/undertheseanlp/UVN-1}
}