Version: Next 🚧

Machine Translation

Overview

The machine translation module in Underthesea provides bidirectional Vietnamese-English translation using the EnviT5 transformer model from VietAI.

Model: VietAI/envit5-translation

Requirements:

pip install "underthesea[deep]"

Architecture

EnviT5 Translator

Translation Pipeline
├── Text Input
│   └── Source language text
├── Preprocessing
│   └── Language prefix: "{lang}: {text}"
├── EnviT5 Model
│   ├── AutoTokenizer
│   └── AutoModelForSeq2SeqLM
│       └── Beam search (num_beams=5)
└── Output
    └── Translated text

Model Configuration

Parameter	Value
Model	VietAI/envit5-translation
Architecture	T5 (Seq2Seq)
Beam search	num_beams=5
Max length	512 tokens
Languages	Vietnamese (vi), English (en)

Usage

Vietnamese to English (Default)

from underthesea import translate

result = translate("Hà Nội là thủ đô của Việt Nam")
# 'Hanoi is the capital of Vietnam'

English to Vietnamese

translate("I love Vietnamese food", source_lang='en', target_lang='vi')
# 'Tôi yêu ẩm thực Việt Nam'

Document Translation

For long texts, combine with sentence tokenization:

from underthesea import sent_tokenize, translate

text = "Hà Nội là thủ đô. Thành phố rất đẹp."
sentences = sent_tokenize(text)
translated = [translate(s) for s in sentences]

Function Signature

def translate(
    text: str,
    source_lang: str = 'vi',
    target_lang: str = 'en'
) -> str

Parameters

Parameter	Type	Default	Description
`text`	str	required	Text to translate
`source_lang`	str	`'vi'`	Source language code
`target_lang`	str	`'en'`	Target language code

Limitations

Works best with well-formed sentences
Long texts should be split into sentences for better results
Only supports Vietnamese-English language pair

Overview​

Architecture​

EnviT5 Translator​

Model Configuration​

Usage​

Vietnamese to English (Default)​

English to Vietnamese​

Document Translation​

Function Signature​

Parameters​

Limitations​

References​