Skip to main content
Version: Next 🚧

UVB

Underthesea Vietnamese Books Dataset (2026 Edition)

A collection of 447 Vietnamese books with full text content and Goodreads metadata for NLP research.

HuggingFace Dataset

Dataset: undertheseanlp/UVB-v0.1

Features

FeatureTypeDescription
idstringUnique identifier (e.g., vn_000001)
titlestringBook title
authorstringAuthor name
contentstringFull text content of the book
genreslist[string]Book genres from Goodreads
first_publishstringFirst publication year
goodreads_idstringGoodreads book ID
goodreads_urlstringGoodreads URL
goodreads_ratingfloatGoodreads rating (1-5)
goodreads_num_ratingsintNumber of ratings

Usage

Load from HuggingFace

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("undertheseanlp/UVB-v0.1")

# Access the data
for item in dataset["train"]:
print(f"Title: {item['title']}")
print(f"Author: {item['author']}")
print(f"Content: {item['content'][:200]}...")
print(f"Genres: {item['genres']}")
print(f"First publish: {item['first_publish']}")
break

# Filter by genre
fiction = dataset["train"].filter(lambda x: "Fiction" in (x.get("genres") or []))
non_fiction = dataset["train"].filter(lambda x: "Non Fiction" in (x.get("genres") or []))

Statistics

MetricValue
Total books447
Books with genres230 (51.5%)
Books with publication year421 (94.2%)
Total size~209 MB

Top Genres

GenreCount
Non Fiction76
Fiction62
Romance37
Classics30
Novels27
Philosophy25
Self Help25
Literature24
History22
Childrens20

Publication Year Distribution

PeriodCount
Before 19006
1900-195012
1951-198038
1981-200082
2001-2010134
2011+149

Source Data

Processing Scripts

Scripts included in the dataset repository:

  • scripts/map_goodreads.py - Map Vietnamese books to Goodreads entries using fuzzy matching
  • scripts/add_genres.py - Fetch genres from Goodreads pages
  • scripts/add_publish_date.py - Fetch first publication year from Goodreads

Sample Books

Fiction

TitleAuthorYearRating
THE COMPLETE SHERLOCK HOLMESArthur Conan Doyle19834.01
LĨNH NAM CHÍCH QUÁITrần Thế Pháp14923.80
1Q84Haruki Murakami20094.10
David CopperfieldCharles Dickens20094.17
SỐNG MÒNNam Cao20084.23

Non Fiction

TitleAuthorYearRating
THE TIBETAN BOOK OF LIVING AND DYINGSogyal Rinpoche19924.21
VIỆT NAM PHONG TỤCPhan Kế Bính19724.09
TỰ HỌC MỘT NHU CẦU CỦA THỜI ĐẠINguyễn Hiến Lê20074.27
THE INFORMATIONJames Gleick20114.03
SỬ KÝ TƯ MÃ THIÊNTư Mã Thiên-4.21

Citation

@misc{uvb_dataset,
title={UVB: Underthesea Vietnamese Books Dataset},
author={Underthesea NLP},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/undertheseanlp/UVB-v0.1}
}

References