Skip to main content
Version: Next 🚧

CRF (Conditional Random Fields)

Classes for training, loading, and running CRF sequence labeling models.

CRFTrainer

Train a CRF model with L-BFGS or Structured Perceptron optimization.

Usage

from underthesea_core import CRFTrainer

X_train = [
[["word=Tôi", "is_upper=False"], ["word=yêu", "is_upper=False"],
["word=Việt", "is_upper=True"], ["word=Nam", "is_upper=True"]],
]
y_train = [
["O", "O", "B-LOC", "I-LOC"],
]

trainer = CRFTrainer(
loss_function="lbfgs",
l1_penalty=1.0,
l2_penalty=0.001,
max_iterations=100,
verbose=1,
)
model = trainer.train(X_train, y_train)
model.save("ner_model.bin")

Constructor

CRFTrainer(
loss_function="lbfgs",
l1_penalty=0.0,
l2_penalty=0.01,
learning_rate=0.1,
max_iterations=100,
averaging=True,
verbose=1,
)

Parameters

ParameterTypeDefaultDescription
loss_functionstr"lbfgs""lbfgs" (recommended) or "perceptron"
l1_penaltyfloat0.0L1 regularization coefficient
l2_penaltyfloat0.01L2 regularization coefficient
learning_ratefloat0.1Learning rate (perceptron only)
max_iterationsint100Maximum training iterations
averagingboolTrueUse averaged perceptron (perceptron only)
verboseint1Verbosity: 0=quiet, 1=progress, 2=detailed

Methods

MethodReturnsDescription
train(X, y)CRFModelTrain on sequences X (list of feature lists) and labels y
set_l1_penalty(penalty)NoneSet L1 regularization penalty
set_l2_penalty(penalty)NoneSet L2 regularization penalty
set_max_iterations(max_iter)NoneSet maximum iterations
get_model()CRFModelGet the current model

CRFModel

Stores trained CRF model weights, labels, and features. Supports save/load.

Usage

from underthesea_core import CRFModel

# Load a saved model
model = CRFModel.load("ner_model.bin")
print(model.num_labels) # number of labels
print(model.num_attributes) # number of attributes
print(model.get_labels()) # list of label names

# Create with predefined labels
model = CRFModel.with_labels(["O", "B-LOC", "I-LOC"])

Constructor

CRFModel()

Static Methods

MethodReturnsDescription
load(path)CRFModelLoad model from file
with_labels(labels)CRFModelCreate model with predefined labels

Properties

PropertyTypeDescription
num_labelsintNumber of labels
num_attributesintNumber of attributes

Methods

MethodReturnsDescription
save(path)NoneSave model to file (CRFsuite format)
get_labels()list[str]Get all label names
num_state_features()intGet number of state features
num_transition_features()intGet number of transition features
l2_norm_squared()floatGet L2 norm squared of all weights
l1_norm()floatGet L1 norm of all weights

CRFTagger

Load a trained CRF model and make predictions on sequences.

Usage

from underthesea_core import CRFTagger, CRFModel

# Load directly
tagger = CRFTagger()
tagger.load("ner_model.bin")

# Or create from model
model = CRFModel.load("ner_model.bin")
tagger = CRFTagger.from_model(model)

# Predict
features = [
["word=Tôi", "is_upper=False"],
["word=sống", "is_upper=False"],
["word=ở", "is_upper=False"],
["word=Hà", "is_upper=True"],
["word=Nội", "is_upper=True"],
]
labels = tagger.tag(features)
# ['O', 'O', 'O', 'B-LOC', 'I-LOC']

# Get labels with score
labels, score = tagger.tag_with_score(features)

# Get marginal probabilities
marginals = tagger.marginals(features)

Constructor

CRFTagger()

Static Methods

MethodReturnsDescription
from_model(model)CRFTaggerCreate tagger from a CRFModel

Methods

MethodReturnsDescription
load(path)NoneLoad model from file
tag(features)list[str]Predict labels for a sequence
tag_with_score(features)(list[str], float)Predict labels with sequence score
marginals(features)list[list[float]]Get marginal probabilities per position and label
labels()list[str]Get all label names
num_labels()intGet number of labels

CRFFeaturizer

Extract features from tokenized sentences for CRF models.

Usage

from underthesea_core import CRFFeaturizer

features = ["T[-1]", "T[0]", "T[1]"]
dictionary = set(["sinh viên"])
featurizer = CRFFeaturizer(features, dictionary)

sentences = [[["sinh", "X"], ["viên", "X"], ["đi", "X"], ["học", "X"]]]
result = featurizer.process(sentences)
# [[['T[-1]=BOS', 'T[0]=sinh', 'T[1]=viên'],
# ['T[-1]=sinh', 'T[0]=viên', 'T[1]=đi'],
# ['T[-1]=viên', 'T[0]=đi', 'T[1]=học'],
# ['T[-1]=đi', 'T[0]=học', 'T[1]=EOS']]]

Constructor

CRFFeaturizer(feature_configs, dictionary)

Parameters

ParameterTypeDescription
feature_configslist[str]Feature template strings (e.g., ["T[-1]", "T[0]", "T[1]"])
dictionaryset[str]Dictionary of known words/phrases

Methods

MethodReturnsDescription
process(sentences)list[list[list[str]]]Extract features from tokenized sentences