Rewriting CRF Model in Rust

February 2, 2026 · 9 min read

Creator of Underthesea

In underthesea v9.2.5, we completed the migration from python-crfsuite to our native Rust implementation underthesea-core. This change resulted in a 20% performance improvement across all CRF-based NLP tasks, plus up to 10x faster training through systematic optimization.

Background

Underthesea uses Conditional Random Fields (CRF) for several core NLP tasks:

Word tokenization
POS tagging
Named Entity Recognition (NER)
Chunking

Previously, we relied on python-crfsuite, a Python wrapper for the CRFsuite C++ library. While functional, this introduced overhead from multiple language boundaries.

The Architecture Change

Before (v9.2.1)

┌─────────────────────────────────────────────────────────────┐
│                        Python                                │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │    Input     │───▶│ CRFFeaturizer│───▶│   Python     │  │
│  │   Tokens     │    │    (Rust)    │    │    List      │  │
│  └──────────────┘    └──────────────┘    └──────┬───────┘  │
│                                                  │          │
│                                                  ▼          │
│                                          ┌──────────────┐  │
│                                          │  pycrfsuite  │  │
│                                          │    (C++)     │  │
│                                          └──────────────┘  │
└─────────────────────────────────────────────────────────────┘

The data flow crossed multiple language boundaries:

Python → Rust (CRFFeaturizer)
Rust → Python (feature list)
Python → C++ (pycrfsuite)
C++ → Python (tags)

After (v9.2.5)

┌─────────────────────────────────────────────────────────────┐
│                        Python                                │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │    Input     │───▶│ CRFFeaturizer│───▶│  CRFTagger   │  │
│  │   Tokens     │    │    (Rust)    │    │    (Rust)    │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│                              │                   │          │
│                              └───────────────────┘          │
│                               underthesea-core              │
└─────────────────────────────────────────────────────────────┘

Now both preprocessing and inference are in Rust within the same module, eliminating the C++ dependency entirely.

Code Changes

The change was minimal from the API perspective:

Before:

import pycrfsuite
from underthesea_core import CRFFeaturizer

class FastCRFSequenceTagger:
    def load(self, base_path):
        estimator = pycrfsuite.Tagger()
        estimator.open(model_path)
        # ...

After:

from underthesea_core import CRFFeaturizer, CRFTagger

class FastCRFSequenceTagger:
    def load(self, base_path):
        estimator = CRFTagger()
        estimator.load(model_path)
        # ...

Inference Benchmark Results

We benchmarked both versions on the same hardware (AMD EPYC 7713, Linux, Python 3.12) with 100 iterations:

Function	v9.2.1 (pycrfsuite)	v9.2.5 (Rust)	Improvement
word_tokenize	1.45ms	1.18ms	-19%
pos_tag	3.58ms	2.93ms	-18%
ner	9.61ms	8.49ms	-12%
chunk	6.19ms	5.65ms	-9%

Why Is Inference Faster?

1. Unified Runtime

Both CRFFeaturizer and CRFTagger are now in the same Rust module. This allows:

Shared memory management
No intermediate Python object creation
Potential for future optimizations (e.g., fusing operations)

2. Optimized Viterbi Implementation

Our Rust implementation uses pre-allocated vectors and cache-friendly memory layouts:

fn viterbi(&self, attr_ids: &[Vec<u32>]) -> TaggingResult {
    let n = attr_ids.len();
    let num_labels = self.model.num_labels;

    // Pre-allocated score matrix
    let mut score = vec![vec![f64::NEG_INFINITY; num_labels]; n];
    let mut back = vec![vec![0u32; num_labels]; n];

    // Cache emission scores per position
    let emission_t = self.model.emission_scores(&attr_ids[t]);

    // Direct memory access in inner loop
    for y in 0..num_labels {
        for y_prev in 0..num_labels {
            let trans = self.model.get_transition(y_prev as u32, y as u32);
            // ...
        }
    }
}

3. Zero-Copy Where Possible

PyO3 bindings allow efficient data transfer between Python and Rust without unnecessary copying.

4. Removed Dependency

Removing python-crfsuite also means:

Simpler installation (no C++ compiler needed)
Smaller package size
Fewer potential compatibility issues

Training Optimizations

Beyond inference, we also optimized the CRF trainer in underthesea-core. The original Rust trainer was 7.2x slower than python-crfsuite for word segmentation. Through four key optimizations, we made it competitive — and even faster for some tasks.

CRF Training Algorithm

The trainer uses Limited-memory BFGS (L-BFGS) optimization with Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) extension for L1 regularization:

minimize: L(w) = -log P(y|x) + λ₁‖w‖₁ + λ₂‖w‖₂²

Where λ₁ = 1.0 (L1 coefficient) and λ₂ = 0.001 (L2 coefficient).

The core computation is the forward-backward algorithm for computing:

Partition function Z(x) via forward pass
Marginal probabilities P(y_t | x) via forward-backward
Gradient ∇L(w) = E_model[f] - E_empirical[f]

Complexity per sequence: O(n × L²) where n is the sequence length and L is the number of labels.

Following CRFsuite's approach, we use scaled probability space instead of log-space:

// Instead of: log_alpha[t][y] = logsumexp(log_alpha[t-1] + log_trans + log_state)
// We use:     alpha[t][y] = sum(alpha[t-1] * exp_trans) * exp_state * scale

Benefits:

No log/exp in inner loops
Better numerical stability with scaling factors
Matches CRFsuite's performance characteristics

Starting Point

Task	python-crfsuite	underthesea-core (original)	Slowdown
Word Segmentation	2m 34s	18m 33s	7.2x slower
POS Tagging	4m 50s	7m 21s	1.5x slower

Optimization 1: Flat Data Structure for Feature Lookup

The original used nested vectors (Vec<Vec<(u32, u32)>>) for feature lookup — each inner Vec separately heap-allocated, causing cache misses for large feature sets (562k features).

We flattened into contiguous arrays with offset indexing:

// Contiguous memory, excellent cache locality
attr_offsets: Vec<u32>               // attr_id -> start index
attr_features_flat: Vec<(u32, u32)>  // flattened (label_id, feature_id) pairs

// Lookup: O(1) with sequential memory access
let start = attr_offsets[attr_id];
let end = attr_offsets[attr_id + 1];
for i in start..end {
    let (label_id, feature_id) = attr_features_flat[i];
    // Process feature...
}

Result:

Task	Before	After	Speedup
Word Segmentation	18m 33s	1m 49s	10.2x
POS Tagging	7m 21s	5m 9s	1.4x

The larger speedup for word segmentation (562k features vs 37k for POS) confirms the feature lookup was the bottleneck.

Optimization 2: Loop Unrolling for Auto-Vectorization

The forward-backward algorithm has O(n × L²) inner loops. For POS tagging (16 labels = 256 transitions per timestep), we applied 4-way manual loop unrolling to enable SIMD auto-vectorization:

// 4-way unrolled for instruction-level parallelism
let chunks = num_labels / 4;
for i in 0..chunks {
    let y = i * 4;
    let a0 = alpha[curr + y];
    let a1 = alpha[curr + y + 1];
    let a2 = alpha[curr + y + 2];
    let a3 = alpha[curr + y + 3];

    let t0 = trans[trans_base + y];
    let t1 = trans[trans_base + y + 1];
    let t2 = trans[trans_base + y + 2];
    let t3 = trans[trans_base + y + 3];

    alpha[curr + y]     = a0 + alpha_prev * t0;
    alpha[curr + y + 1] = a1 + alpha_prev * t1;
    alpha[curr + y + 2] = a2 + alpha_prev * t2;
    alpha[curr + y + 3] = a3 + alpha_prev * t3;
}

Result: POS tagging (10 iterations) went from 25.7s to 17.58s — a 1.46x speedup.

Optimization 3: Unsafe Bounds-Check Elimination

We used unsafe with get_unchecked for hot paths where indices are provably valid, eliminating Rust's bounds checks in tight loops:

// Safe but slow: 2 bounds checks per iteration
for y in 0..num_labels {
    gradient[feature_id] += state_mexp[base + y];
}

// Unsafe but fast: 0 bounds checks
unsafe {
    for y in 0..num_labels {
        *gradient.get_unchecked_mut(feature_id) +=
            *state_mexp.get_unchecked(base + y);
    }
}

All unsafe blocks are guarded by loop bounds derived from array lengths, assertions, and algorithm invariants.

Optimization 4: Fused Operations

Separate loops for related operations cause redundant memory traversals. We fused them:

// Before: 3 separate loops, 3 memory traversals
for y in 0..L { alpha[y] *= exp_state[y]; }
for y in 0..L { sum += alpha[y]; }
for y in 0..L { alpha[y] *= scale; }

// After: 1 fused loop + 1 normalization pass
let mut sum = 0.0;
for y in 0..L {
    let val = alpha[y] * exp_state[y];
    alpha[y] = val;
    sum += val;
}
let scale = 1.0 / sum;
for y in 0..L { alpha[y] *= scale; }

Cumulative Optimization Impact

Optimization	Word Seg Speedup	POS Tag Speedup
Baseline (original)	1.0x	1.0x
+ Flat data structure	10.2x	1.4x
+ Loop unrolling	10.2x	2.1x
+ Unsafe bounds elim	~10.2x	~2.3x
Total	10.2x	2.3x

Training Benchmark Results (200 iterations)

Task	Features	Labels	python-crfsuite	underthesea-core	Result
Word Segmentation	562,885	2	2m 2s	1m 38s	1.24x faster
POS Tagging	626,723	16	4m 3s	4m 14s	~equal (4% slower)

Accuracy Verification

Task	Metric	python-crfsuite	underthesea-core
Word Segmentation	Syllable Accuracy	98.89%	98.89%
Word Segmentation	Word F1	98.00%	98.00%
POS Tagging	Accuracy	95.98%	95.97%

Accuracy is identical — optimizations only affected performance, not correctness.

What Didn't Work

We evaluated several additional optimizations that did not provide significant improvements:

Explicit SIMD Intrinsics (AVX2): The inner loops process only 2-16 labels, too small for explicit SIMD to outperform the compiler's auto-vectorization with loop unrolling.
Parallel Forward-Backward (Rayon): Thread-local gradient accumulation overhead from buffer allocation per sequence and gradient merging negated the parallelism benefits. Sequential processing with buffer reuse remains faster.
Memory Pool for Temporary Buffers: Already implemented — the current implementation reuses buffers across sequences within each L-BFGS evaluation. Further pooling across evaluations showed minimal improvement.
Compressed Sparse Features: The flat data structure with offset indexing already provides efficient sparse feature access — additional compression just adds decode overhead.

Key Insight: Different Tasks, Different Bottlenecks

Feature Set Size	Bottleneck	Best Optimization
Large (500k+)	Feature lookup (cache misses)	Flat data structure
Small (<50k)	Forward-backward O(L²)	Loop unrolling

Why python-crfsuite Was Initially Faster

CRFsuite (C implementation) already had:

Hand-optimized sparse feature storage — similar to our flat structure
SIMD-vectorized matrix operations — AVX/SSE intrinsics
Cache-optimized memory layout — column-major for transitions
Decades of optimization — mature codebase

Our flat data structure and loop unrolling effectively replicated these advantages in Rust.

Lessons Learned

Profile first — the bottleneck was different for each task
Data structure matters — flat arrays beat nested vectors by 10x
Cache locality is critical — sequential memory access enables hardware prefetching
Unsafe Rust is justified — when correctness is provable and performance is critical
Incremental migration reduces risk — migrating one task at a time allowed validation at each step

Migration Path

The migration was done incrementally across 4 releases:

Version	Changes
v9.2.2	word_tokenize migrated
v9.2.3	pos_tag, ner, chunking migrated
v9.2.4	CRFTrainer migrated, removed unused files
v9.2.5	Removed python-crfsuite dependency

Model Compatibility

The Rust implementation can load existing .crfsuite model files trained by python-crfsuite. No retraining is required.

# Works with both old and new models
from underthesea import word_tokenize
word_tokenize("Hà Nội là thủ đô của Việt Nam")
# ['Hà Nội', 'là', 'thủ đô', 'của', 'Việt Nam']

Conclusion

By rewriting our CRF implementation in Rust and unifying the pipeline, we achieved:

12-19% faster inference across all CRF-based tasks
1.24x faster training for word segmentation (10x from original Rust implementation)
Identical accuracy — no degradation from the migration
Simpler dependency tree (no python-crfsuite / C++ compiler needed)
Better maintainability with a single Rust codebase

The full implementation is available in underthesea-core.

Try It Out

pip install underthesea==9.2.5

from underthesea import word_tokenize, pos_tag, ner, chunk

word_tokenize("Việt Nam")  # 20% faster!

Appendix

Hyperparameters

Parameter	Value	Description
c1 (L1)	1.0	L1 regularization coefficient
c2 (L2)	0.001	L2 regularization coefficient
max_iterations	200	Maximum L-BFGS iterations
linesearch	Backtracking	Line search algorithm for OWL-QN
max_step_size	1e20	Allow large steps (critical for convergence)

Hardware

Component	Specification
CPU	AMD EPYC 7713 64-Core Processor
Platform	Linux
Rust	1.75+ (release mode with LTO)
Python	3.12

Code References

File	Description
`underthesea_core/src/crf/trainer.rs`	Main CRF trainer implementation
`underthesea_core/src/crf/model.rs`	CRF model structure
`tre-1/scripts/train.py`	POS tagger training script
`tre-1/scripts/train_word_segmentation.py`	Word segmentation training script

Detailed Benchmark Results (2026-01-31)

10-Iteration Tests:

Task	Trainer	Training Time	Accuracy
POS Tagging	python-crfsuite	12.96s	78.37%
POS Tagging	underthesea-core	18.51s	75.42%
Word Segmentation	python-crfsuite	6.78s	81.44% F1
Word Segmentation	underthesea-core	11.99s	82.81% F1

200-Iteration Tests:

Task	Trainer	Training Time	Accuracy
POS Tagging	python-crfsuite	243.22s	95.98%
POS Tagging	underthesea-core	254.07s	95.97%
Word Segmentation	python-crfsuite	121.69s	98.89% / 98.00% F1
Word Segmentation	underthesea-core	98.34s	98.89% / 98.00% F1

Background​

The Architecture Change​

Before (v9.2.1)​

After (v9.2.5)​

Code Changes​

Inference Benchmark Results​

Why Is Inference Faster?​

1. Unified Runtime​

2. Optimized Viterbi Implementation​

3. Zero-Copy Where Possible​

4. Removed Dependency​

Training Optimizations​

CRF Training Algorithm​

Starting Point​

Optimization 1: Flat Data Structure for Feature Lookup​

Optimization 2: Loop Unrolling for Auto-Vectorization​

Optimization 3: Unsafe Bounds-Check Elimination​

Optimization 4: Fused Operations​

Cumulative Optimization Impact​

Training Benchmark Results (200 iterations)​

Accuracy Verification​

What Didn't Work​

Key Insight: Different Tasks, Different Bottlenecks​

Why python-crfsuite Was Initially Faster​

Lessons Learned​

Migration Path​

Model Compatibility​

Conclusion​

Try It Out​

Appendix​

Hyperparameters​

Hardware​

Code References​

Detailed Benchmark Results (2026-01-31)​