Skip to main content
Version: Next 🚧

Named Entity Recognition

Overview​

The NER module in Underthesea identifies and classifies named entities in Vietnamese text, supporting both a lightweight CRF model and a deep learning Transformers model. Entities are classified into persons (PER), locations (LOC), and organizations (ORG).

Architecture​

Dual Model Support​

NER Pipeline
├── Text Input
│ └── Raw Vietnamese text
├── Mode Selection
│ ├── Shallow (default)
│ │ ├── word_tokenize()
│ │ ├── pos_tag()
│ │ ├── chunk()
│ │ └── CRF NER Model
│ └── Deep (deep=True)
│ └── HuggingFace Transformers
│ └── undertheseanlp/vietnamese-ner-v1.4.0a2
└── Output
└── Entity annotations (BIO format)

CRF Model (Default)​

The shallow NER model builds on the full preprocessing pipeline (word tokenization → POS tagging → chunking) and applies a CRF sequence labeler for entity classification.

Transformers Model (Deep)​

The deep learning model uses HuggingFace's AutoModelForTokenClassification with the pretrained model undertheseanlp/vietnamese-ner-v1.4.0a2. It handles subword tokenization merging internally.

Requirements:

pip install "underthesea[deep]"

Entity Types​

TagEntity TypeExample
B-PER / I-PERPersonDonald Trump
B-LOC / I-LOCLocationViệt Nam, Mỹ
B-ORG / I-ORGOrganizationBộ Giáo dục
ONot an entity—

Usage​

CRF Model (Default)​

from underthesea import ner

text = "Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump"
entities = ner(text)
# [('Chưa', 'R', 'O', 'O'),
# ('tiết lộ', 'V', 'B-VP', 'O'),
# ('lịch trình', 'V', 'B-VP', 'O'),
# ('tá»›i', 'E', 'B-PP', 'O'),
# ('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
# ('cá»§a', 'E', 'B-PP', 'O'),
# ('Tổng thống', 'N', 'B-NP', 'O'),
# ('Mỹ', 'Np', 'B-NP', 'B-LOC'),
# ('Donald', 'Np', 'B-NP', 'B-PER'),
# ('Trump', 'Np', 'B-NP', 'I-PER')]

Deep Learning Model​

entities = ner(text, deep=True)
# [{'entity': 'LOC', 'word': 'Việt Nam'},
# {'entity': 'LOC', 'word': 'Mỹ'},
# {'entity': 'PER', 'word': 'Donald Trump'}]

Function Signature​

def ner(
sentence: str,
format: str = None,
deep: bool = False
) -> list[tuple] | list[dict]

Parameters​

ParameterTypeDefaultDescription
sentencestrrequiredInput text
formatstrNoneOutput format
deepboolFalseUse deep learning model

Models​

ModelTypeHuggingFace
CRF (default)Sequence labeling—
DeepToken classificationundertheseanlp/vietnamese-ner-v1.4.0a2

References​

  1. undertheseanlp/vietnamese-ner-v1.4.0a2
  2. Underthesea GitHub Repository