Version: Next 🚧

CRF (Conditional Random Fields)

Classes for training, loading, and running CRF sequence labeling models.

CRFTrainer

Train a CRF model with L-BFGS or Structured Perceptron optimization.

Usage

from underthesea_core import CRFTrainer

X_train = [
    [["word=Tôi", "is_upper=False"], ["word=yêu", "is_upper=False"],
     ["word=Việt", "is_upper=True"], ["word=Nam", "is_upper=True"]],
]
y_train = [
    ["O", "O", "B-LOC", "I-LOC"],
]

trainer = CRFTrainer(
    loss_function="lbfgs",
    l1_penalty=1.0,
    l2_penalty=0.001,
    max_iterations=100,
    verbose=1,
)
model = trainer.train(X_train, y_train)
model.save("ner_model.bin")

Constructor

CRFTrainer(
    loss_function="lbfgs",
    l1_penalty=0.0,
    l2_penalty=0.01,
    learning_rate=0.1,
    max_iterations=100,
    averaging=True,
    verbose=1,
)

Parameters

Parameter	Type	Default	Description
`loss_function`	`str`	`"lbfgs"`	`"lbfgs"` (recommended) or `"perceptron"`
`l1_penalty`	`float`	`0.0`	L1 regularization coefficient
`l2_penalty`	`float`	`0.01`	L2 regularization coefficient
`learning_rate`	`float`	`0.1`	Learning rate (perceptron only)
`max_iterations`	`int`	`100`	Maximum training iterations
`averaging`	`bool`	`True`	Use averaged perceptron (perceptron only)
`verbose`	`int`	`1`	Verbosity: 0=quiet, 1=progress, 2=detailed

Methods

Method	Returns	Description
`train(X, y)`	`CRFModel`	Train on sequences X (list of feature lists) and labels y
`set_l1_penalty(penalty)`	`None`	Set L1 regularization penalty
`set_l2_penalty(penalty)`	`None`	Set L2 regularization penalty
`set_max_iterations(max_iter)`	`None`	Set maximum iterations
`get_model()`	`CRFModel`	Get the current model

CRFModel

Stores trained CRF model weights, labels, and features. Supports save/load.

Usage

from underthesea_core import CRFModel

# Load a saved model
model = CRFModel.load("ner_model.bin")
print(model.num_labels)        # number of labels
print(model.num_attributes)    # number of attributes
print(model.get_labels())      # list of label names

# Create with predefined labels
model = CRFModel.with_labels(["O", "B-LOC", "I-LOC"])

Constructor

CRFModel()

Static Methods

Method	Returns	Description
`load(path)`	`CRFModel`	Load model from file
`with_labels(labels)`	`CRFModel`	Create model with predefined labels

Properties

Property	Type	Description
`num_labels`	`int`	Number of labels
`num_attributes`	`int`	Number of attributes

Methods

Method	Returns	Description
`save(path)`	`None`	Save model to file (CRFsuite format)
`get_labels()`	`list[str]`	Get all label names
`num_state_features()`	`int`	Get number of state features
`num_transition_features()`	`int`	Get number of transition features
`l2_norm_squared()`	`float`	Get L2 norm squared of all weights
`l1_norm()`	`float`	Get L1 norm of all weights

CRFTagger

Load a trained CRF model and make predictions on sequences.

Usage

from underthesea_core import CRFTagger, CRFModel

# Load directly
tagger = CRFTagger()
tagger.load("ner_model.bin")

# Or create from model
model = CRFModel.load("ner_model.bin")
tagger = CRFTagger.from_model(model)

# Predict
features = [
    ["word=Tôi", "is_upper=False"],
    ["word=sống", "is_upper=False"],
    ["word=ở", "is_upper=False"],
    ["word=Hà", "is_upper=True"],
    ["word=Nội", "is_upper=True"],
]
labels = tagger.tag(features)
# ['O', 'O', 'O', 'B-LOC', 'I-LOC']

# Get labels with score
labels, score = tagger.tag_with_score(features)

# Get marginal probabilities
marginals = tagger.marginals(features)

Constructor

CRFTagger()

Static Methods

Method	Returns	Description
`from_model(model)`	`CRFTagger`	Create tagger from a `CRFModel`

Methods

Method	Returns	Description
`load(path)`	`None`	Load model from file
`tag(features)`	`list[str]`	Predict labels for a sequence
`tag_with_score(features)`	`(list[str], float)`	Predict labels with sequence score
`marginals(features)`	`list[list[float]]`	Get marginal probabilities per position and label
`labels()`	`list[str]`	Get all label names
`num_labels()`	`int`	Get number of labels

CRFFeaturizer

Extract features from tokenized sentences for CRF models.

Usage

from underthesea_core import CRFFeaturizer

features = ["T[-1]", "T[0]", "T[1]"]
dictionary = set(["sinh viên"])
featurizer = CRFFeaturizer(features, dictionary)

sentences = [[["sinh", "X"], ["viên", "X"], ["đi", "X"], ["học", "X"]]]
result = featurizer.process(sentences)
# [[['T[-1]=BOS', 'T[0]=sinh', 'T[1]=viên'],
#   ['T[-1]=sinh', 'T[0]=viên', 'T[1]=đi'],
#   ['T[-1]=viên', 'T[0]=đi', 'T[1]=học'],
#   ['T[-1]=đi', 'T[0]=học', 'T[1]=EOS']]]

Constructor

CRFFeaturizer(feature_configs, dictionary)

Parameters

Parameter	Type	Description
`feature_configs`	`list[str]`	Feature template strings (e.g., `["T[-1]", "T[0]", "T[1]"]`)
`dictionary`	`set[str]`	Dictionary of known words/phrases

Methods

Method	Returns	Description
`process(sentences)`	`list[list[list[str]]]`	Extract features from tokenized sentences

CRFTrainer​

Usage​

Constructor​

Parameters​

Methods​

CRFModel​

Usage​

Constructor​

Static Methods​

Properties​

Methods​

CRFTagger​

Usage​

Constructor​

Static Methods​

Methods​

CRFFeaturizer​

Usage​

Constructor​

Parameters​

Methods​

CRFTrainer

Usage

Constructor

Parameters

Methods

CRFModel

Usage

Constructor

Static Methods

Properties

Methods

CRFTagger

Usage

Constructor

Static Methods

Methods

CRFFeaturizer

Usage

Constructor

Parameters

Methods