Voice
This document provides a technical overview of the AI models used in the underthesea voice (text-to-speech) module.
Overviewβ
The voice module implements a neural text-to-speech (TTS) system for Vietnamese. It is based on VietTTS by NTT123 and uses a two-stage architecture:
- Text-to-Mel: Converts text/phonemes to mel-spectrogram
- Mel-to-Wave (Vocoder): Converts mel-spectrogram to audio waveform
Text β [Text Normalization] β [Duration Model] β [Acoustic Model] β Mel β [HiFi-GAN] β Audio
Installationβ
pip install "underthesea[voice]"
underthesea download-model VIET_TTS_V0_4_1
Model Architectureβ
1. Duration Modelβ
The Duration Model predicts the duration (in seconds) for each phoneme in the input sequence.
Architecture:
| Component | Description |
|---|---|
| Token Encoder | Embedding + 3Γ Conv1D + Bidirectional LSTM |
| Projection | Linear β GELU β Linear β Softplus |
Parameters:
| Parameter | Value |
|---|---|
| Vocabulary Size | 256 |
| LSTM Dimension | 256 |
| Dropout Rate | 0.5 |
Input: Phoneme sequence with lengths Output: Duration for each phoneme (in seconds)
2. Acoustic Modelβ
The Acoustic Model generates mel-spectrograms from phonemes and their predicted durations.
Architecture:
| Component | Description |
|---|---|
| Token Encoder | Embedding + 3Γ Conv1D + Bidirectional LSTM |
| Upsampling | Gaussian attention-based upsampling |
| PreNet | 2Γ Linear (256 dim) with dropout |
| Decoder | 2Γ LSTM with skip connections |
| PostNet | 5Γ Conv1D with batch normalization |
| Projection | Linear to mel dimension |
Parameters:
| Parameter | Value |
|---|---|
| Encoder Dimension | 256 |
| Decoder Dimension | 512 |
| PostNet Dimension | 512 |
| Mel Dimension | 80 |
Key Features:
- Gaussian Upsampling: Uses soft attention to upsample encoder outputs to match target frame length
- Autoregressive Decoder: Generates mel frames sequentially with teacher forcing during training
- Zoneout Regularization: Applies zoneout to LSTM states during training for better generalization
- PostNet Refinement: Residual convolutional network refines the predicted mel-spectrogram
3. HiFi-GAN Vocoderβ
The vocoder converts mel-spectrograms to raw audio waveforms using the HiFi-GAN architecture.
Architecture:
| Component | Description |
|---|---|
| Conv Pre | Conv1D (7 kernel) |
| Upsampling | Multiple Conv1DTranspose layers |
| Multi-Receptive Field Fusion (MRF) | ResBlocks with varying kernel sizes and dilations |
| Conv Post | Conv1D (7 kernel) + Tanh |
Key Features:
- Multi-Scale Upsampling: Progressive upsampling from mel frame rate to audio sample rate
- Multi-Receptive Field Fusion: Combines outputs from residual blocks with different receptive fields
- Leaky ReLU Activation: Uses leaky ReLU with slope 0.1 throughout
ResBlock Types:
- ResBlock1: 3 dilated convolutions (dilation: 1, 3, 5) with residual connections
- ResBlock2: 2 dilated convolutions (dilation: 1, 3) with residual connections
Audio Configurationβ
| Parameter | Value |
|---|---|
| Sample Rate | 16,000 Hz |
| FFT Size | 1,024 |
| Mel Channels | 80 |
| Frequency Range | 0 - 8,000 Hz |
Text Processing Pipelineβ
1. Text Normalizationβ
The input text is normalized before synthesis:
# Normalization steps:
1. Unicode NFKC normalization
2. Lowercase conversion
3. Punctuation β silence markers
4. Multiple spaces β single space
2. Phoneme Conversionβ
Text is converted to phonemes using a lexicon lookup:
- Vietnamese characters are mapped to phoneme sequences
- Special tokens:
sil(silence),sp(short pause),(word boundary) - Unknown words are processed character-by-character
Model Filesβ
The VIET_TTS_V0_4_1 model package includes:
| File | Description |
|---|---|
lexicon.txt | Word-to-phoneme mapping |
duration_latest_ckpt.pickle | Duration model weights |
acoustic_latest_ckpt.pickle | Acoustic model weights |
hk_hifi.pickle | HiFi-GAN vocoder weights |
config.json | HiFi-GAN configuration |
Framework Dependenciesβ
The voice module uses JAX ecosystem:
| Library | Purpose |
|---|---|
| JAX | Numerical computation and automatic differentiation |
| JAXlib | JAX backend (CPU/GPU/TPU support) |
| dm-haiku | Neural network library for JAX |
| Optax | Gradient processing and optimization |
Usage Exampleβ
from underthesea.pipeline.tts import tts
# Basic usage
tts("Xin chΓ o Viα»t Nam") # Creates sound.wav
# Custom output file
tts("HΓ Nα»i lΓ thα»§ ΔΓ΄", outfile="output.wav")
# With playback
tts("ΔΓ’y lΓ mα»t vΓ dα»₯", play=True)
Performance Considerationsβ
- First Call Latency: Model loading on first call may take several seconds
- JAX Compilation: JIT compilation occurs on first inference, subsequent calls are faster
- Text Length: Maximum recommended text length is 500 characters
- Memory Usage: GPU memory usage depends on input text length
Limitationsβ
- Single Speaker: Current model supports only one voice
- Vietnamese Only: Designed specifically for Vietnamese language
- Prosody: Limited control over prosody and emotion
- Real-time: Not optimized for real-time streaming
Referencesβ
- VietTTS - Original implementation by NTT123
- HiFi-GAN - Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
- Non-Attentive Tacotron - Robust and Controllable Neural TTS Synthesis
- dm-haiku - JAX neural network library by DeepMind