Skip to main content
Version: 9.2.11

Text Normalization

Overview

The text normalization module in Underthesea fixes common Vietnamese encoding and diacritic issues, converting text to Unicode NFC normalized form. It handles legacy encoding problems from older Vietnamese text systems.

Architecture

Text Normalization Pipeline
├── Text Input
│ └── Raw Vietnamese text (potentially with encoding issues)
├── Word Tokenization
│ └── Split text into tokens
├── Token Normalization
│ ├── Character Normalization
│ │ ├── Đ/Ð confusion (Ðại → Đại)
│ │ ├── Old-style diacritics (hoá → hóa)
│ │ └── Incorrect vowel composition (lựơng → lượng)
│ └── Unicode NFC Normalization
└── Output
└── Normalized text string

Components

ComponentDescription
text_normalizerMain normalization entry point
token_normalizeToken-level normalization
character_normalizeCharacter-level encoding fixes

Common Issues Fixed

IssueExampleCorrected
Đ/Ð confusionÐại họcĐại học
Old-style diacriticshoá họchóa học
Incorrect vowel compositionlựơnglượng
Mixed encodingbaỏ đảmbảo đảm

Usage

Basic Usage

from underthesea import text_normalize

text = "Ðảm baỏ chất lựơng"
result = text_normalize(text)
print(result) # "Đảm bảo chất lượng"

Function Signature

def text_normalize(text: str, tokenizer: str = 'underthesea') -> str

Parameters

ParameterTypeDefaultDescription
textstrrequiredInput text to normalize
tokenizerstr'underthesea'Tokenizer to use

References

  1. Unicode NFC Normalization
  2. Underthesea GitHub Repository