Skip to main content
Version: 9.2.11

lang_detect

Identify the language of text.

!!! note "Requires Language Detection" This function requires the langdetect dependencies:

pip install "underthesea[langdetect]"

Usage

from underthesea import lang_detect

text = "Cựu binh Mỹ trả nhật ký nhẹ lòng khi thấy cuộc sống hòa bình tại Việt Nam"
lang = lang_detect(text)
print(lang)
# 'vi'

Function Signature

def lang_detect(text: str) -> str

Parameters

ParameterTypeDescription
textstrThe input text to analyze

Returns

TypeDescription
strISO 639-1 language code

Supported Languages

The function can detect 176 languages. Common codes:

CodeLanguage
viVietnamese
enEnglish
zhChinese
jaJapanese
koKorean
frFrench
deGerman
esSpanish
ruRussian
thThai

Examples

Basic Usage

from underthesea import lang_detect

# Vietnamese
lang_detect("Cựu binh Mỹ trả nhật ký nhẹ lòng")
# 'vi'

# English
lang_detect("Hello, how are you today?")
# 'en'

# Chinese
lang_detect("你好,今天怎么样?")
# 'zh'

# Japanese
lang_detect("こんにちは、元気ですか?")
# 'ja'

Detecting Multiple Texts

from underthesea import lang_detect

texts = [
"Xin chào Việt Nam",
"Hello World",
"Bonjour le monde",
"Hallo Welt"
]

for text in texts:
lang = lang_detect(text)
print(f"{text} -> {lang}")
# Xin chào Việt Nam -> vi
# Hello World -> en
# Bonjour le monde -> fr
# Hallo Welt -> de

Filtering by Language

from underthesea import lang_detect

documents = [
"Việt Nam là một đất nước xinh đẹp",
"This is an English sentence",
"Hôm nay trời đẹp quá",
"The weather is nice today"
]

# Filter Vietnamese documents
vietnamese_docs = [doc for doc in documents if lang_detect(doc) == 'vi']
print(vietnamese_docs)
# ['Việt Nam là một đất nước xinh đẹp', 'Hôm nay trời đẹp quá']

Language Statistics

from collections import Counter
from underthesea import lang_detect

documents = [
"Xin chào",
"Hello",
"Tạm biệt",
"Goodbye",
"Cảm ơn",
"Merci"
]

langs = [lang_detect(doc) for doc in documents]
distribution = Counter(langs)
print(distribution)
# Counter({'vi': 3, 'en': 2, 'fr': 1})

Notes

  • Uses FastText's language identification model
  • Works best with longer text (at least a few words)
  • Very short text may be less accurate
  • First call may take longer due to model loading