Version: Next 🚧

lang_detect

Identify the language of text.

!!! note "Requires Language Detection" This function requires the langdetect dependencies:

pip install "underthesea[langdetect]"

Usage

from underthesea import lang_detect

text = "Cựu binh Mỹ trả nhật ký nhẹ lòng khi thấy cuộc sống hòa bình tại Việt Nam"
lang = lang_detect(text)
print(lang)
# 'vi'

Function Signature

def lang_detect(text: str) -> str

Parameters

Parameter	Type	Description
`text`	`str`	The input text to analyze

Returns

Type	Description
`str`	ISO 639-1 language code

Supported Languages

The function can detect 176 languages. Common codes:

Code	Language
`vi`	Vietnamese
`en`	English
`zh`	Chinese
`ja`	Japanese
`ko`	Korean
`fr`	French
`de`	German
`es`	Spanish
`ru`	Russian
`th`	Thai

Examples

Basic Usage

from underthesea import lang_detect

# Vietnamese
lang_detect("Cựu binh Mỹ trả nhật ký nhẹ lòng")
# 'vi'

# English
lang_detect("Hello, how are you today?")
# 'en'

# Chinese
lang_detect("你好，今天怎么样？")
# 'zh'

# Japanese
lang_detect("こんにちは、元気ですか？")
# 'ja'

Detecting Multiple Texts

from underthesea import lang_detect

texts = [
    "Xin chào Việt Nam",
    "Hello World",
    "Bonjour le monde",
    "Hallo Welt"
]

for text in texts:
    lang = lang_detect(text)
    print(f"{text} -> {lang}")
# Xin chào Việt Nam -> vi
# Hello World -> en
# Bonjour le monde -> fr
# Hallo Welt -> de

Filtering by Language

from underthesea import lang_detect

documents = [
    "Việt Nam là một đất nước xinh đẹp",
    "This is an English sentence",
    "Hôm nay trời đẹp quá",
    "The weather is nice today"
]

# Filter Vietnamese documents
vietnamese_docs = [doc for doc in documents if lang_detect(doc) == 'vi']
print(vietnamese_docs)
# ['Việt Nam là một đất nước xinh đẹp', 'Hôm nay trời đẹp quá']

Language Statistics

from collections import Counter
from underthesea import lang_detect

documents = [
    "Xin chào",
    "Hello",
    "Tạm biệt",
    "Goodbye",
    "Cảm ơn",
    "Merci"
]

langs = [lang_detect(doc) for doc in documents]
distribution = Counter(langs)
print(distribution)
# Counter({'vi': 3, 'en': 2, 'fr': 1})

Notes

Uses FastText's language identification model
Works best with longer text (at least a few words)
Very short text may be less accurate
First call may take longer due to model loading

Usage​

Function Signature​

Parameters​

Returns​

Supported Languages​

Examples​

Basic Usage​

Detecting Multiple Texts​

Filtering by Language​

Language Statistics​

Notes​

Usage

Function Signature

Parameters

Returns

Supported Languages

Examples

Basic Usage

Detecting Multiple Texts

Filtering by Language

Language Statistics

Notes