Skip to main content
Version: 9.2.11

UUD-v0.1

Universal Dependency Dataset for Vietnamese

Vietnamese Universal Dependency dataset following Universal Dependencies annotation guidelines. Machine-generated using Underthesea NLP toolkit.

HuggingFace Dataset​

Dataset: undertheseanlp/UDD-v0.1

Summary​

MetricValue
Sentences3,000
Tokens64,814
Avg sentence length21.60
Max sentence length65
Avg tree depth6.77
Max tree depth21
SourceVietnamese Legal Corpus (UTS_VLC)
Validation0 errors (passes all UD checks)

Features​

FieldTypeDescription
sent_idstringSentence identifier
textstringOriginal sentence text
tokenslist[string]Tokenized words
lemmaslist[string]Lemmatized forms
uposlist[string]Universal POS tags
xposlist[string]Language-specific POS tags
featslist[string]Morphological features
headlist[string]Head token indices
deprellist[string]Dependency relations
depslist[string]Enhanced dependencies
misclist[string]Miscellaneous annotations

Usage​

Load from HuggingFace​

from datasets import load_dataset

dataset = load_dataset("undertheseanlp/UDD-v0.1")
print(dataset["train"][0])

Clone and Run Scripts​

# Clone the dataset repository
git clone https://huggingface.co/datasets/undertheseanlp/UDD-v0.1
cd UDD-v0.1

# Install dependencies with uv
uv sync

# Fetch sentences from UTS_VLC
uv run python scripts/fetch_data.py

# Convert to UD format
uv run python scripts/convert_to_ud.py

# Run statistics
uv run python scripts/statistics.py

# Upload to HuggingFace
uv run python scripts/upload_to_hf.py

UPOS Distribution​

TagCountPercent
NOUN21,59933.32%
VERB15,79324.37%
PUNCT6,3919.86%
ADP6,3099.73%
CCONJ2,9424.54%
AUX2,6654.11%
ADV2,5183.88%
ADJ2,2543.48%
NUM1,4442.23%
DET1,3502.08%
PRON1,1281.74%
PROPN3180.49%

Top Dependency Relations​

RelationCountPercent
obj6,4489.95%
punct6,3919.86%
nmod5,8709.06%
case5,8539.03%
conj4,9207.59%
compound3,3145.11%
root3,0004.63%
acl:subj2,8894.46%
nsubj2,8694.43%
nmod:poss1,6562.56%

Root UPOS Distribution​

UPOSCountPercent
VERB2,22074.00%
NOUN63921.30%
ADJ632.10%
ADP411.37%
AUX170.57%
PROPN140.47%

Scripts​

The dataset repository includes scripts for data processing:

ScriptDescription
scripts/fetch_data.pyFetch sentences from UTS_VLC corpus
scripts/convert_to_ud.pyConvert to UD format with syntax fixes
scripts/statistics.pyCompute dataset statistics
scripts/upload_to_hf.pyUpload to HuggingFace Hub

Other Vietnamese dependency treebanks include: UD_Vietnamese-VTB - the official Vietnamese treebank in Universal Dependencies, converted from VietTreebank constituent treebank created by VLSP project (UD v1.4+); VnDT - the first Vietnamese dependency treebank with 10,200 sentences automatically converted from VietTreebank and manually edited (2013, revised 2016); BKTreebank - a dependency treebank with 6,900 sentences featuring custom POS tagset and dependency relations designed specifically for Vietnamese linguistic characteristics (LREC 2018); and VLSP shared task data - training and test data from VLSP dependency parsing shared tasks with 8,152 sentences following Universal Dependencies v2 annotation scheme (2019-2020), where top models achieved 76.27% LAS and 84.65% UAS using PhoBERT+ELMO/Biaffine architecture.

DatasetSentencesTokensDomainAnnotationFormatAvailable
UUD-v0.13,00064,814LegalMachine-generatedCoNLL-UHuggingFace
UD_Vietnamese-VTB3,32358,069News (Tuoi Tre)ManualCoNLL-UUD, GitHub, HuggingFace
VnDT10,200~170KNews (Tuoi Tre)Semi-automaticCoNLLGitHub
BKTreebank6,900~115KMixedManualCoNLLACL
VLSP 20208,152~140KMixedManualCoNLL-UVLSP

References​