Version: Next 🚧

Vietnamese News Dataset

A dataset of Vietnamese news articles collected from 6 major Vietnamese newspapers for NLP research.

Dataset Description

Homepage: https://github.com/undertheseanlp/underthesea
Repository: https://huggingface.co/datasets/undertheseanlp/UVN-1
Point of Contact: Vu Anh (anhv.ict91@gmail.com)

Dataset Summary

This dataset contains 3,268 Vietnamese news articles covering various topics including politics, business, sports, entertainment, education, health, and technology. It is designed for Vietnamese NLP research tasks such as:

Text classification (news categorization)
Language modeling
Text generation
Named entity recognition
Keyword extraction

Languages

Vietnamese (vi)

Dataset Structure

Data Instance

{
  "id": "article-00001",
  "source": "vnexpress",
  "url": "https://vnexpress.net/example-123.html",
  "category": "Thời sự",
  "title": "Tiêu đề bài báo",
  "content": "Nội dung bài báo...",
  "publish_date": "2026-01-28"
}

Data Fields

Field	Type	Description
`id`	string	Unique identifier (article-XXXXX)
`source`	string	News source
`url`	string	Original article URL
`category`	string	News category
`title`	string	Article title
`content`	string	Full article content
`publish_date`	string	Publication date (yyyy-mm-dd)

Data Splits

Split	Count
train	3,268

Sources

Source	Website	Count
VnExpress	vnexpress.net	1,000
Dân Trí	dantri.com.vn	976
Thanh Niên	thanhnien.vn	579
Tuổi Trẻ	tuoitre.vn	210
Tiền Phong	tienphong.vn	240
Người Lao Động	nld.com.vn	263

Usage

from datasets import load_dataset

dataset = load_dataset("undertheseanlp/UVN-1")

# Access articles
for article in dataset["train"]:
    print(article["id"], article["title"])
    break

# Filter by source
vnexpress = dataset["train"].filter(lambda x: x["source"] == "vnexpress")

# Filter by category
sports = dataset["train"].filter(lambda x: x["category"] == "Thể thao")

Considerations

Biases

Reflects editorial perspectives of mainstream Vietnamese news outlets
May be biased toward urban and national news coverage

Limitations

Articles from 2025-2026 time period
Uneven distribution across sources due to website structure differences

License

CC-BY-NC-4.0. For research and educational purposes only.

Citation

@misc{vietnamese_news_dataset,
  title={Vietnamese News Dataset},
  author={Underthesea NLP},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/undertheseanlp/UVN-1}
}

Dataset Description​

Dataset Summary​

Languages​

Dataset Structure​

Data Instance​

Data Fields​

Data Splits​

Sources​

Categories​

Usage​

Considerations​

Biases​

Limitations​

License​

Citation​