UVW
Underthesea Vietnamese Wikipedia Dataset (2026 Edition)
A high-quality, cleaned dataset of 1.1M Vietnamese Wikipedia articles enriched with Wikidata metadata for NLP research.
HuggingFace Dataset
Dataset: undertheseanlp/UVW-2026
Features
| Feature | Type | Description |
|---|---|---|
| id | string | Unique identifier (URL-safe title) |
| title | string | Article title |
| content | string | Cleaned article text |
| num_chars | int32 | Character count |
| num_sentences | int32 | Sentence count |
| quality_score | int32 | Quality score (1-10) |
| wikidata_id | string | Wikidata Q-identifier |
| main_category | string | Primary category from Wikidata P31 |
Usage
Load from HuggingFace
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("undertheseanlp/UVW-2026")
# Access the data
train = dataset["train"]
print(train[0])
# Filter high-quality articles (score >= 7)
high_quality = train.filter(lambda x: x["quality_score"] >= 7)
# Filter by category
people = train.filter(lambda x: x["main_category"] == "người")
Statistics
| Metric | Value |
|---|---|
| Total articles | 1,118,224 |
| Train split | 894,579 (80%) |
| Validation split | 111,822 (10%) |
| Test split | 111,823 (10%) |
| Wikidata coverage | 99.4% |
| Category coverage | 97.0% |
| Unique categories | 11,549 |
| Avg. characters | 1,190 |
| Avg. sentences | 10 |
Quality Score Distribution
| Score | Count | Percentage |
|---|---|---|
| 1 | 134 | 0.0% |
| 2 | 376 | 0.0% |
| 3 | 28,267 | 2.5% |
| 4 | 607,081 | 54.3% |
| 5 | 208,304 | 18.6% |
| 6 | 134,385 | 12.0% |
| 7 | 70,345 | 6.3% |
| 8 | 57,054 | 5.1% |
| 9 | 9,649 | 0.9% |
| 10 | 2,629 | 0.2% |
Top Categories
| Category (Vietnamese) | Count | Percentage |
|---|---|---|
| đơn vị phân loại (taxon) | 618,281 | 55.3% |
| người (human) | 78,191 | 7.0% |
| xã của Pháp | 35,635 | 3.2% |
| khu định cư | 20,276 | 1.8% |
| tiểu hành tinh | 17,891 | 1.6% |
| xã của Việt Nam | 7,088 | 0.6% |
| đô thị của Ý | 6,700 | 0.6% |
| trang định hướng Wikimedia | 6,202 | 0.6% |
Quality Scoring
Articles are scored 1-10 based on:
| Component | Weight | Criteria |
|---|---|---|
| Length | 40% | Character count (200 - 100,000 optimal) |
| Sentences | 30% | Sentence count (3 - 1,000 optimal) |
| Density | 30% | Avg sentence length (80-150 chars optimal) |
| Wikidata bonus | +0.5 | Has wikidata_id |
| Category bonus | +0.5 | Has main_category |
| Markup penalty | -1 to -3 | Remaining Wikipedia markup |
Data Processing Pipeline
- Download - Vietnamese Wikipedia XML dump from Wikimedia
- Extract - Parse XML and extract article content
- Clean - Remove Wikipedia markup (templates, refs, links, tables)
- Normalize - Apply Unicode NFC normalization
- Score - Calculate quality metrics for each article
- Enrich - Add Wikidata IDs and categories via Wikidata API
- Filter - Remove special pages, redirects, disambiguation, short articles
- Split - Create train/validation/test splits (80/10/10, seed=42)
Removed Content
- Wikipedia templates (
{{...}}) - References and citations (
<ref>...</ref>) - HTML tags and comments
- Category links (
[[Thể loại:...]]) - File/image links (
[[Tập tin:...]],[[File:...]]) - Interwiki links
- Tables (
{| ... |}) - Infoboxes and navigation templates
Sample Articles
| Title | Category | Quality | Wikidata |
|---|---|---|---|
| Việt Nam | quốc gia có chủ quyền | 9 | Q881 |
| Hà Nội | thủ đô | 9 | Q1858 |
| Nguyễn Du | người | 8 | Q332972 |
| Sông Mê Kông | sông | 8 | Q3056359 |
| Phở | món ăn | 7 | Q217666 |
Citation
@dataset{uvw2026,
title = {UVW 2026: Underthesea Vietnamese Wikipedia Dataset},
author = {Underthesea NLP},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/undertheseanlp/UVW-2026},
note = {Vietnamese Wikipedia articles enriched with Wikidata metadata}
}