Skip to main content
Version: 9.2.11

Language Identification

Overview

The language identification module in Underthesea provides automatic language detection for text input using the Radar-1 model. Given a text string, the module identifies which of 11 supported languages the text is written in.

Model: undertheseanlp/radar-1 License: Apache 2.0

Supported Languages

CodeLanguage
viVietnamese
enEnglish
zhChinese
jaJapanese
koKorean
frFrench
deGerman
esSpanish
thThai
loLao
kmKhmer

Architecture

Radar-1 Model

The Radar-1 model is a text classification model designed for language detection, with a focus on Vietnamese and Southeast Asian languages.

Radar-1 Pipeline
├── Text Input
│ └── Raw text string
├── Feature Extraction
│ └── Character and token-level features
├── Classification
│ └── Language prediction with confidence score
└── Output
├── Language code (e.g., "vi", "en")
└── Confidence score (0.0 - 1.0)

Usage

Basic Usage

from underthesea import lang_detect

text = "Xin chào, tôi là người Việt Nam"
language = lang_detect(text)
print(language) # vi

Advanced API with Confidence Scores

from radar import RadarLangDetector, detect

# Quick detection
lang = detect("Hello world")
print(lang) # en

# With confidence scores
detector = RadarLangDetector.load("models/radar-1")
result = detector.predict("Xin chào Việt Nam")
print(result.lang) # vi
print(result.score) # 0.98

Multi-language Examples

from underthesea import lang_detect

# Vietnamese
lang_detect("Xin chào, tôi là người Việt Nam") # vi

# English
lang_detect("Hello, how are you?") # en

# Japanese
lang_detect("こんにちは世界") # ja

# Chinese
lang_detect("你好世界") # zh

# Korean
lang_detect("안녕하세요") # ko

# Thai
lang_detect("สวัสดีครับ") # th

Training

python src/train.py

References

  1. Radar-1 on Hugging Face
  2. Underthesea GitHub Repository