Language Identification
Overview
The language identification module in Underthesea provides automatic language detection for text input using the Radar-1 model. Given a text string, the module identifies which of 11 supported languages the text is written in.
Model: undertheseanlp/radar-1 License: Apache 2.0
Supported Languages
| Code | Language |
|---|---|
| vi | Vietnamese |
| en | English |
| zh | Chinese |
| ja | Japanese |
| ko | Korean |
| fr | French |
| de | German |
| es | Spanish |
| th | Thai |
| lo | Lao |
| km | Khmer |
Architecture
Radar-1 Model
The Radar-1 model is a text classification model designed for language detection, with a focus on Vietnamese and Southeast Asian languages.
Radar-1 Pipeline
├── Text Input
│ └── Raw text string
├── Feature Extraction
│ └── Character and token-level features
├── Classification
│ └── Language prediction with confidence score
└── Output
├── Language code (e.g., "vi", "en")
└── Confidence score (0.0 - 1.0)
Usage
Basic Usage
from underthesea import lang_detect
text = "Xin chào, tôi là người Việt Nam"
language = lang_detect(text)
print(language) # vi
Advanced API with Confidence Scores
from radar import RadarLangDetector, detect
# Quick detection
lang = detect("Hello world")
print(lang) # en
# With confidence scores
detector = RadarLangDetector.load("models/radar-1")
result = detector.predict("Xin chào Việt Nam")
print(result.lang) # vi
print(result.score) # 0.98
Multi-language Examples
from underthesea import lang_detect
# Vietnamese
lang_detect("Xin chào, tôi là người Việt Nam") # vi
# English
lang_detect("Hello, how are you?") # en
# Japanese
lang_detect("こんにちは世界") # ja
# Chinese
lang_detect("你好世界") # zh
# Korean
lang_detect("안녕하세요") # ko
# Thai
lang_detect("สวัสดีครับ") # th
Training
python src/train.py