Skip to main content
Version: Next 🚧

Voice

This document provides a technical overview of the AI models used in the underthesea voice (text-to-speech) module.

Overview​

The voice module implements a neural text-to-speech (TTS) system for Vietnamese. It is based on VietTTS by NTT123 and uses a two-stage architecture:

  1. Text-to-Mel: Converts text/phonemes to mel-spectrogram
  2. Mel-to-Wave (Vocoder): Converts mel-spectrogram to audio waveform
Text β†’ [Text Normalization] β†’ [Duration Model] β†’ [Acoustic Model] β†’ Mel β†’ [HiFi-GAN] β†’ Audio

Installation​

pip install "underthesea[voice]"
underthesea download-model VIET_TTS_V0_4_1

Model Architecture​

1. Duration Model​

The Duration Model predicts the duration (in seconds) for each phoneme in the input sequence.

Architecture:

ComponentDescription
Token EncoderEmbedding + 3Γ— Conv1D + Bidirectional LSTM
ProjectionLinear β†’ GELU β†’ Linear β†’ Softplus

Parameters:

ParameterValue
Vocabulary Size256
LSTM Dimension256
Dropout Rate0.5

Input: Phoneme sequence with lengths Output: Duration for each phoneme (in seconds)

2. Acoustic Model​

The Acoustic Model generates mel-spectrograms from phonemes and their predicted durations.

Architecture:

ComponentDescription
Token EncoderEmbedding + 3Γ— Conv1D + Bidirectional LSTM
UpsamplingGaussian attention-based upsampling
PreNet2Γ— Linear (256 dim) with dropout
Decoder2Γ— LSTM with skip connections
PostNet5Γ— Conv1D with batch normalization
ProjectionLinear to mel dimension

Parameters:

ParameterValue
Encoder Dimension256
Decoder Dimension512
PostNet Dimension512
Mel Dimension80

Key Features:

  • Gaussian Upsampling: Uses soft attention to upsample encoder outputs to match target frame length
  • Autoregressive Decoder: Generates mel frames sequentially with teacher forcing during training
  • Zoneout Regularization: Applies zoneout to LSTM states during training for better generalization
  • PostNet Refinement: Residual convolutional network refines the predicted mel-spectrogram

3. HiFi-GAN Vocoder​

The vocoder converts mel-spectrograms to raw audio waveforms using the HiFi-GAN architecture.

Architecture:

ComponentDescription
Conv PreConv1D (7 kernel)
UpsamplingMultiple Conv1DTranspose layers
Multi-Receptive Field Fusion (MRF)ResBlocks with varying kernel sizes and dilations
Conv PostConv1D (7 kernel) + Tanh

Key Features:

  • Multi-Scale Upsampling: Progressive upsampling from mel frame rate to audio sample rate
  • Multi-Receptive Field Fusion: Combines outputs from residual blocks with different receptive fields
  • Leaky ReLU Activation: Uses leaky ReLU with slope 0.1 throughout

ResBlock Types:

  • ResBlock1: 3 dilated convolutions (dilation: 1, 3, 5) with residual connections
  • ResBlock2: 2 dilated convolutions (dilation: 1, 3) with residual connections

Audio Configuration​

ParameterValue
Sample Rate16,000 Hz
FFT Size1,024
Mel Channels80
Frequency Range0 - 8,000 Hz

Text Processing Pipeline​

1. Text Normalization​

The input text is normalized before synthesis:

# Normalization steps:
1. Unicode NFKC normalization
2. Lowercase conversion
3. Punctuation β†’ silence markers
4. Multiple spaces β†’ single space

2. Phoneme Conversion​

Text is converted to phonemes using a lexicon lookup:

  • Vietnamese characters are mapped to phoneme sequences
  • Special tokens: sil (silence), sp (short pause), (word boundary)
  • Unknown words are processed character-by-character

Model Files​

The VIET_TTS_V0_4_1 model package includes:

FileDescription
lexicon.txtWord-to-phoneme mapping
duration_latest_ckpt.pickleDuration model weights
acoustic_latest_ckpt.pickleAcoustic model weights
hk_hifi.pickleHiFi-GAN vocoder weights
config.jsonHiFi-GAN configuration

Framework Dependencies​

The voice module uses JAX ecosystem:

LibraryPurpose
JAXNumerical computation and automatic differentiation
JAXlibJAX backend (CPU/GPU/TPU support)
dm-haikuNeural network library for JAX
OptaxGradient processing and optimization

Usage Example​

from underthesea.pipeline.tts import tts

# Basic usage
tts("Xin chΓ o Việt Nam") # Creates sound.wav

# Custom output file
tts("HΓ  Nα»™i lΓ  thα»§ Δ‘Γ΄", outfile="output.wav")

# With playback
tts("ĐÒy lΓ  mα»™t vΓ­ dα»₯", play=True)

Performance Considerations​

  • First Call Latency: Model loading on first call may take several seconds
  • JAX Compilation: JIT compilation occurs on first inference, subsequent calls are faster
  • Text Length: Maximum recommended text length is 500 characters
  • Memory Usage: GPU memory usage depends on input text length

Limitations​

  • Single Speaker: Current model supports only one voice
  • Vietnamese Only: Designed specifically for Vietnamese language
  • Prosody: Limited control over prosody and emotion
  • Real-time: Not optimized for real-time streaming

References​

  • VietTTS - Original implementation by NTT123
  • HiFi-GAN - Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • Non-Attentive Tacotron - Robust and Controllable Neural TTS Synthesis
  • dm-haiku - JAX neural network library by DeepMind