Skip to main content

Rewriting CRF Model in Rust

Β· 9 min read
Vu Anh
Creator of Underthesea

In underthesea v9.2.5, we completed the migration from python-crfsuite to our native Rust implementation underthesea-core. This change resulted in a 20% performance improvement across all CRF-based NLP tasks, plus up to 10x faster training through systematic optimization.

Background​

Underthesea uses Conditional Random Fields (CRF) for several core NLP tasks:

  • Word tokenization
  • POS tagging
  • Named Entity Recognition (NER)
  • Chunking

Previously, we relied on python-crfsuite, a Python wrapper for the CRFsuite C++ library. While functional, this introduced overhead from multiple language boundaries.

The Architecture Change​

Before (v9.2.1)​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Python β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Input │───▢│ CRFFeaturizer│───▢│ Python β”‚ β”‚
β”‚ β”‚ Tokens β”‚ β”‚ (Rust) β”‚ β”‚ List β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ pycrfsuite β”‚ β”‚
β”‚ β”‚ (C++) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The data flow crossed multiple language boundaries:

  1. Python β†’ Rust (CRFFeaturizer)
  2. Rust β†’ Python (feature list)
  3. Python β†’ C++ (pycrfsuite)
  4. C++ β†’ Python (tags)

After (v9.2.5)​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Python β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Input │───▢│ CRFFeaturizer│───▢│ CRFTagger β”‚ β”‚
β”‚ β”‚ Tokens β”‚ β”‚ (Rust) β”‚ β”‚ (Rust) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ underthesea-core β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Now both preprocessing and inference are in Rust within the same module, eliminating the C++ dependency entirely.

Code Changes​

The change was minimal from the API perspective:

Before:

import pycrfsuite
from underthesea_core import CRFFeaturizer

class FastCRFSequenceTagger:
def load(self, base_path):
estimator = pycrfsuite.Tagger()
estimator.open(model_path)
# ...

After:

from underthesea_core import CRFFeaturizer, CRFTagger

class FastCRFSequenceTagger:
def load(self, base_path):
estimator = CRFTagger()
estimator.load(model_path)
# ...

Inference Benchmark Results​

We benchmarked both versions on the same hardware (AMD EPYC 7713, Linux, Python 3.12) with 100 iterations:

Functionv9.2.1 (pycrfsuite)v9.2.5 (Rust)Improvement
word_tokenize1.45ms1.18ms-19%
pos_tag3.58ms2.93ms-18%
ner9.61ms8.49ms-12%
chunk6.19ms5.65ms-9%

Why Is Inference Faster?​

1. Unified Runtime​

Both CRFFeaturizer and CRFTagger are now in the same Rust module. This allows:

  • Shared memory management
  • No intermediate Python object creation
  • Potential for future optimizations (e.g., fusing operations)

2. Optimized Viterbi Implementation​

Our Rust implementation uses pre-allocated vectors and cache-friendly memory layouts:

fn viterbi(&self, attr_ids: &[Vec<u32>]) -> TaggingResult {
let n = attr_ids.len();
let num_labels = self.model.num_labels;

// Pre-allocated score matrix
let mut score = vec![vec![f64::NEG_INFINITY; num_labels]; n];
let mut back = vec![vec![0u32; num_labels]; n];

// Cache emission scores per position
let emission_t = self.model.emission_scores(&attr_ids[t]);

// Direct memory access in inner loop
for y in 0..num_labels {
for y_prev in 0..num_labels {
let trans = self.model.get_transition(y_prev as u32, y as u32);
// ...
}
}
}

3. Zero-Copy Where Possible​

PyO3 bindings allow efficient data transfer between Python and Rust without unnecessary copying.

4. Removed Dependency​

Removing python-crfsuite also means:

  • Simpler installation (no C++ compiler needed)
  • Smaller package size
  • Fewer potential compatibility issues

Training Optimizations​

Beyond inference, we also optimized the CRF trainer in underthesea-core. The original Rust trainer was 7.2x slower than python-crfsuite for word segmentation. Through four key optimizations, we made it competitive β€” and even faster for some tasks.

CRF Training Algorithm​

The trainer uses Limited-memory BFGS (L-BFGS) optimization with Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) extension for L1 regularization:

minimize: L(w) = -log P(y|x) + λ₁‖w‖₁ + Ξ»β‚‚β€–wβ€–β‚‚Β²

Where λ₁ = 1.0 (L1 coefficient) and Ξ»β‚‚ = 0.001 (L2 coefficient).

The core computation is the forward-backward algorithm for computing:

  1. Partition function Z(x) via forward pass
  2. Marginal probabilities P(y_t | x) via forward-backward
  3. Gradient βˆ‡L(w) = E_model[f] - E_empirical[f]

Complexity per sequence: O(n Γ— LΒ²) where n is the sequence length and L is the number of labels.

Following CRFsuite's approach, we use scaled probability space instead of log-space:

// Instead of: log_alpha[t][y] = logsumexp(log_alpha[t-1] + log_trans + log_state)
// We use: alpha[t][y] = sum(alpha[t-1] * exp_trans) * exp_state * scale

Benefits:

  • No log/exp in inner loops
  • Better numerical stability with scaling factors
  • Matches CRFsuite's performance characteristics

Starting Point​

Taskpython-crfsuiteunderthesea-core (original)Slowdown
Word Segmentation2m 34s18m 33s7.2x slower
POS Tagging4m 50s7m 21s1.5x slower

Optimization 1: Flat Data Structure for Feature Lookup​

The original used nested vectors (Vec<Vec<(u32, u32)>>) for feature lookup β€” each inner Vec separately heap-allocated, causing cache misses for large feature sets (562k features).

We flattened into contiguous arrays with offset indexing:

// Contiguous memory, excellent cache locality
attr_offsets: Vec<u32> // attr_id -> start index
attr_features_flat: Vec<(u32, u32)> // flattened (label_id, feature_id) pairs

// Lookup: O(1) with sequential memory access
let start = attr_offsets[attr_id];
let end = attr_offsets[attr_id + 1];
for i in start..end {
let (label_id, feature_id) = attr_features_flat[i];
// Process feature...
}

Result:

TaskBeforeAfterSpeedup
Word Segmentation18m 33s1m 49s10.2x
POS Tagging7m 21s5m 9s1.4x

The larger speedup for word segmentation (562k features vs 37k for POS) confirms the feature lookup was the bottleneck.

Optimization 2: Loop Unrolling for Auto-Vectorization​

The forward-backward algorithm has O(n Γ— LΒ²) inner loops. For POS tagging (16 labels = 256 transitions per timestep), we applied 4-way manual loop unrolling to enable SIMD auto-vectorization:

// 4-way unrolled for instruction-level parallelism
let chunks = num_labels / 4;
for i in 0..chunks {
let y = i * 4;
let a0 = alpha[curr + y];
let a1 = alpha[curr + y + 1];
let a2 = alpha[curr + y + 2];
let a3 = alpha[curr + y + 3];

let t0 = trans[trans_base + y];
let t1 = trans[trans_base + y + 1];
let t2 = trans[trans_base + y + 2];
let t3 = trans[trans_base + y + 3];

alpha[curr + y] = a0 + alpha_prev * t0;
alpha[curr + y + 1] = a1 + alpha_prev * t1;
alpha[curr + y + 2] = a2 + alpha_prev * t2;
alpha[curr + y + 3] = a3 + alpha_prev * t3;
}

Result: POS tagging (10 iterations) went from 25.7s to 17.58s β€” a 1.46x speedup.

Optimization 3: Unsafe Bounds-Check Elimination​

We used unsafe with get_unchecked for hot paths where indices are provably valid, eliminating Rust's bounds checks in tight loops:

// Safe but slow: 2 bounds checks per iteration
for y in 0..num_labels {
gradient[feature_id] += state_mexp[base + y];
}

// Unsafe but fast: 0 bounds checks
unsafe {
for y in 0..num_labels {
*gradient.get_unchecked_mut(feature_id) +=
*state_mexp.get_unchecked(base + y);
}
}

All unsafe blocks are guarded by loop bounds derived from array lengths, assertions, and algorithm invariants.

Optimization 4: Fused Operations​

Separate loops for related operations cause redundant memory traversals. We fused them:

// Before: 3 separate loops, 3 memory traversals
for y in 0..L { alpha[y] *= exp_state[y]; }
for y in 0..L { sum += alpha[y]; }
for y in 0..L { alpha[y] *= scale; }

// After: 1 fused loop + 1 normalization pass
let mut sum = 0.0;
for y in 0..L {
let val = alpha[y] * exp_state[y];
alpha[y] = val;
sum += val;
}
let scale = 1.0 / sum;
for y in 0..L { alpha[y] *= scale; }

Cumulative Optimization Impact​

OptimizationWord Seg SpeedupPOS Tag Speedup
Baseline (original)1.0x1.0x
+ Flat data structure10.2x1.4x
+ Loop unrolling10.2x2.1x
+ Unsafe bounds elim~10.2x~2.3x
Total10.2x2.3x

Training Benchmark Results (200 iterations)​

TaskFeaturesLabelspython-crfsuiteunderthesea-coreResult
Word Segmentation562,88522m 2s1m 38s1.24x faster
POS Tagging626,723164m 3s4m 14s~equal (4% slower)

Accuracy Verification​

TaskMetricpython-crfsuiteunderthesea-core
Word SegmentationSyllable Accuracy98.89%98.89%
Word SegmentationWord F198.00%98.00%
POS TaggingAccuracy95.98%95.97%

Accuracy is identical β€” optimizations only affected performance, not correctness.

What Didn't Work​

We evaluated several additional optimizations that did not provide significant improvements:

  • Explicit SIMD Intrinsics (AVX2): The inner loops process only 2-16 labels, too small for explicit SIMD to outperform the compiler's auto-vectorization with loop unrolling.
  • Parallel Forward-Backward (Rayon): Thread-local gradient accumulation overhead from buffer allocation per sequence and gradient merging negated the parallelism benefits. Sequential processing with buffer reuse remains faster.
  • Memory Pool for Temporary Buffers: Already implemented β€” the current implementation reuses buffers across sequences within each L-BFGS evaluation. Further pooling across evaluations showed minimal improvement.
  • Compressed Sparse Features: The flat data structure with offset indexing already provides efficient sparse feature access β€” additional compression just adds decode overhead.

Key Insight: Different Tasks, Different Bottlenecks​

Feature Set SizeBottleneckBest Optimization
Large (500k+)Feature lookup (cache misses)Flat data structure
Small (<50k)Forward-backward O(LΒ²)Loop unrolling

Why python-crfsuite Was Initially Faster​

CRFsuite (C implementation) already had:

  1. Hand-optimized sparse feature storage β€” similar to our flat structure
  2. SIMD-vectorized matrix operations β€” AVX/SSE intrinsics
  3. Cache-optimized memory layout β€” column-major for transitions
  4. Decades of optimization β€” mature codebase

Our flat data structure and loop unrolling effectively replicated these advantages in Rust.

Lessons Learned​

  1. Profile first β€” the bottleneck was different for each task
  2. Data structure matters β€” flat arrays beat nested vectors by 10x
  3. Cache locality is critical β€” sequential memory access enables hardware prefetching
  4. Unsafe Rust is justified β€” when correctness is provable and performance is critical
  5. Incremental migration reduces risk β€” migrating one task at a time allowed validation at each step

Migration Path​

The migration was done incrementally across 4 releases:

VersionChanges
v9.2.2word_tokenize migrated
v9.2.3pos_tag, ner, chunking migrated
v9.2.4CRFTrainer migrated, removed unused files
v9.2.5Removed python-crfsuite dependency

Model Compatibility​

The Rust implementation can load existing .crfsuite model files trained by python-crfsuite. No retraining is required.

# Works with both old and new models
from underthesea import word_tokenize
word_tokenize("HΓ  Nα»™i lΓ  thα»§ Δ‘Γ΄ cα»§a Việt Nam")
# ['HΓ  Nα»™i', 'lΓ ', 'thα»§ Δ‘Γ΄', 'cα»§a', 'Việt Nam']

Conclusion​

By rewriting our CRF implementation in Rust and unifying the pipeline, we achieved:

  • 12-19% faster inference across all CRF-based tasks
  • 1.24x faster training for word segmentation (10x from original Rust implementation)
  • Identical accuracy β€” no degradation from the migration
  • Simpler dependency tree (no python-crfsuite / C++ compiler needed)
  • Better maintainability with a single Rust codebase

The full implementation is available in underthesea-core.

Try It Out​

pip install underthesea==9.2.5
from underthesea import word_tokenize, pos_tag, ner, chunk

word_tokenize("Việt Nam") # 20% faster!

Appendix​

Hyperparameters​

ParameterValueDescription
c1 (L1)1.0L1 regularization coefficient
c2 (L2)0.001L2 regularization coefficient
max_iterations200Maximum L-BFGS iterations
linesearchBacktrackingLine search algorithm for OWL-QN
max_step_size1e20Allow large steps (critical for convergence)

Hardware​

ComponentSpecification
CPUAMD EPYC 7713 64-Core Processor
PlatformLinux
Rust1.75+ (release mode with LTO)
Python3.12

Code References​

FileDescription
underthesea_core/src/crf/trainer.rsMain CRF trainer implementation
underthesea_core/src/crf/model.rsCRF model structure
tre-1/scripts/train.pyPOS tagger training script
tre-1/scripts/train_word_segmentation.pyWord segmentation training script

Detailed Benchmark Results (2026-01-31)​

10-Iteration Tests:

TaskTrainerTraining TimeAccuracy
POS Taggingpython-crfsuite12.96s78.37%
POS Taggingunderthesea-core18.51s75.42%
Word Segmentationpython-crfsuite6.78s81.44% F1
Word Segmentationunderthesea-core11.99s82.81% F1

200-Iteration Tests:

TaskTrainerTraining TimeAccuracy
POS Taggingpython-crfsuite243.22s95.98%
POS Taggingunderthesea-core254.07s95.97%
Word Segmentationpython-crfsuite121.69s98.89% / 98.00% F1
Word Segmentationunderthesea-core98.34s98.89% / 98.00% F1