Skip to main content
Version: Next 🚧

UVW

Underthesea Vietnamese Wikipedia Dataset (2026 Edition)

A high-quality, cleaned dataset of 1.1M Vietnamese Wikipedia articles enriched with Wikidata metadata for NLP research.

HuggingFace Dataset

Dataset: undertheseanlp/UVW-2026

Features

FeatureTypeDescription
idstringUnique identifier (URL-safe title)
titlestringArticle title
contentstringCleaned article text
num_charsint32Character count
num_sentencesint32Sentence count
quality_scoreint32Quality score (1-10)
wikidata_idstringWikidata Q-identifier
main_categorystringPrimary category from Wikidata P31

Usage

Load from HuggingFace

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("undertheseanlp/UVW-2026")

# Access the data
train = dataset["train"]
print(train[0])

# Filter high-quality articles (score >= 7)
high_quality = train.filter(lambda x: x["quality_score"] >= 7)

# Filter by category
people = train.filter(lambda x: x["main_category"] == "người")

Statistics

MetricValue
Total articles1,118,224
Train split894,579 (80%)
Validation split111,822 (10%)
Test split111,823 (10%)
Wikidata coverage99.4%
Category coverage97.0%
Unique categories11,549
Avg. characters1,190
Avg. sentences10

Quality Score Distribution

ScoreCountPercentage
11340.0%
23760.0%
328,2672.5%
4607,08154.3%
5208,30418.6%
6134,38512.0%
770,3456.3%
857,0545.1%
99,6490.9%
102,6290.2%

Top Categories

Category (Vietnamese)CountPercentage
đơn vị phân loại (taxon)618,28155.3%
người (human)78,1917.0%
xã của Pháp35,6353.2%
khu định cư20,2761.8%
tiểu hành tinh17,8911.6%
xã của Việt Nam7,0880.6%
đô thị của Ý6,7000.6%
trang định hướng Wikimedia6,2020.6%

Quality Scoring

Articles are scored 1-10 based on:

ComponentWeightCriteria
Length40%Character count (200 - 100,000 optimal)
Sentences30%Sentence count (3 - 1,000 optimal)
Density30%Avg sentence length (80-150 chars optimal)
Wikidata bonus+0.5Has wikidata_id
Category bonus+0.5Has main_category
Markup penalty-1 to -3Remaining Wikipedia markup

Data Processing Pipeline

  1. Download - Vietnamese Wikipedia XML dump from Wikimedia
  2. Extract - Parse XML and extract article content
  3. Clean - Remove Wikipedia markup (templates, refs, links, tables)
  4. Normalize - Apply Unicode NFC normalization
  5. Score - Calculate quality metrics for each article
  6. Enrich - Add Wikidata IDs and categories via Wikidata API
  7. Filter - Remove special pages, redirects, disambiguation, short articles
  8. Split - Create train/validation/test splits (80/10/10, seed=42)

Removed Content

  • Wikipedia templates ({{...}})
  • References and citations (<ref>...</ref>)
  • HTML tags and comments
  • Category links ([[Thể loại:...]])
  • File/image links ([[Tập tin:...]], [[File:...]])
  • Interwiki links
  • Tables ({| ... |})
  • Infoboxes and navigation templates

Sample Articles

TitleCategoryQualityWikidata
Việt Namquốc gia có chủ quyền9Q881
Hà Nộithủ đô9Q1858
Nguyễn Dungười8Q332972
Sông Mê Kôngsông8Q3056359
Phởmón ăn7Q217666

Citation

@dataset{uvw2026,
title = {UVW 2026: Underthesea Vietnamese Wikipedia Dataset},
author = {Underthesea NLP},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/undertheseanlp/UVW-2026},
note = {Vietnamese Wikipedia articles enriched with Wikidata metadata}
}

References