Skip to main content
Version: 9.2.11

TextPreprocessor

Configurable Vietnamese text preprocessing pipeline. Serializable with the TextClassifier model so preprocessing config always travels with the model.

Pipeline Steps

Applied in order:

  1. Unicode NFC normalization
  2. Lowercase
  3. URL removal
  4. Repeated character normalization ("đẹppp""đẹpp")
  5. Punctuation normalization ("!!!""!", "????""?")
  6. Teencode expansion ("ko""không", "dc""được")
  7. Negation marking ("không tốt""không NEG_tốt")

Usage

from underthesea_core import TextPreprocessor

# Default Vietnamese preprocessing
pp = TextPreprocessor()
pp.transform("Sản phẩm ko đẹp lắm!!!")
# "sản phẩm không NEG_đẹp NEG_lắm!"

# Batch processing
results = pp.transform_batch(["Ko đẹp", "SP tốt lắm!!!"])
# ["không NEG_đẹp", "sản phẩm tốt lắm!"]

# Custom teencode dictionary
pp = TextPreprocessor(teencode={"ko": "không", "dc": "được"})

# Custom negation words and window
pp = TextPreprocessor(
negation_words=["không", "chưa", "chẳng"],
negation_window=3,
)

# Disable specific steps
pp = TextPreprocessor(lowercase=False, remove_urls=False)

# Disable teencode and negation entirely
pp = TextPreprocessor(teencode=None, negation_words=None, use_defaults=False)

Constructor

TextPreprocessor(
lowercase=True,
unicode_normalize=True,
remove_urls=True,
normalize_repeated_chars=True,
normalize_punctuation=True,
teencode=None,
negation_words=None,
negation_window=2,
use_defaults=True,
)

Parameters

ParameterTypeDefaultDescription
lowercaseboolTrueConvert text to lowercase
unicode_normalizeboolTrueApply Unicode NFC normalization
remove_urlsboolTrueRemove URLs (http/https/www)
normalize_repeated_charsboolTrueReduce 3+ repeated chars to 2
normalize_punctuationboolTrueReduce repeated punctuation
teencodedict | NoneNoneCustom teencode dictionary. With use_defaults=True, defaults to built-in Vietnamese teencode
negation_wordslist[str] | NoneNoneCustom negation words. With use_defaults=True, defaults to built-in Vietnamese negation words
negation_windowint2Number of words after negation word to mark with NEG_ prefix
use_defaultsboolTrueWhen True, use Vietnamese defaults for teencode/negation if not provided. When False, None means disabled

Properties

PropertyTypeDescription
teencodedict | NoneCurrent teencode dictionary
negation_wordslist[str] | NoneCurrent negation words
negation_windowintCurrent negation window size

Methods

MethodReturnsDescription
transform(text)strPreprocess a single text string
transform_batch(texts)list[str]Preprocess a list of texts

Default Teencode Dictionary

TeencodeExpansion
ko, k, hok, hemkhông
dc, đc, dkđược
spsản phẩm
bt, bthbình thường
ok, oketốt
tks, thanks, thankcảm ơn
ntnnhư thế nào
mnmọi người
cx, cgcũng
vsvới
...(30+ rules total)

Default Negation Words

không, chẳng, chả, chưa, đừng, ko, hok, hem, chăng

With TextClassifier

from underthesea_core import TextClassifier, TextPreprocessor

pp = TextPreprocessor()
clf = TextClassifier(preprocessor=pp)
clf.fit(texts, labels)

# Preprocessor is saved together with the model
clf.save("model.bin")
clf = TextClassifier.load("model.bin") # preprocessor is restored