Rewriting CRF Model in Rust
In underthesea v9.2.5, we completed the migration from python-crfsuite to our native Rust implementation underthesea-core. This change resulted in a 20% performance improvement across all CRF-based NLP tasks, plus up to 10x faster training through systematic optimization.
Backgroundβ
Underthesea uses Conditional Random Fields (CRF) for several core NLP tasks:
- Word tokenization
- POS tagging
- Named Entity Recognition (NER)
- Chunking
Previously, we relied on python-crfsuite, a Python wrapper for the CRFsuite C++ library. While functional, this introduced overhead from multiple language boundaries.
The Architecture Changeβ
Before (v9.2.1)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Input βββββΆβ CRFFeaturizerβββββΆβ Python β β
β β Tokens β β (Rust) β β List β β
β ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β pycrfsuite β β
β β (C++) β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The data flow crossed multiple language boundaries:
- Python β Rust (CRFFeaturizer)
- Rust β Python (feature list)
- Python β C++ (pycrfsuite)
- C++ β Python (tags)
After (v9.2.5)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Input βββββΆβ CRFFeaturizerβββββΆβ CRFTagger β β
β β Tokens β β (Rust) β β (Rust) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β
β βββββββββββββββββββββ β
β underthesea-core β
ββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
Now both preprocessing and inference are in Rust within the same module, eliminating the C++ dependency entirely.
Code Changesβ
The change was minimal from the API perspective:
Before:
import pycrfsuite
from underthesea_core import CRFFeaturizer
class FastCRFSequenceTagger:
def load(self, base_path):
estimator = pycrfsuite.Tagger()
estimator.open(model_path)
# ...
After:
from underthesea_core import CRFFeaturizer, CRFTagger
class FastCRFSequenceTagger:
def load(self, base_path):
estimator = CRFTagger()
estimator.load(model_path)
# ...
Inference Benchmark Resultsβ
We benchmarked both versions on the same hardware (AMD EPYC 7713, Linux, Python 3.12) with 100 iterations:
| Function | v9.2.1 (pycrfsuite) | v9.2.5 (Rust) | Improvement |
|---|---|---|---|
| word_tokenize | 1.45ms | 1.18ms | -19% |
| pos_tag | 3.58ms | 2.93ms | -18% |
| ner | 9.61ms | 8.49ms | -12% |
| chunk | 6.19ms | 5.65ms | -9% |
Why Is Inference Faster?β
1. Unified Runtimeβ
Both CRFFeaturizer and CRFTagger are now in the same Rust module. This allows:
- Shared memory management
- No intermediate Python object creation
- Potential for future optimizations (e.g., fusing operations)
2. Optimized Viterbi Implementationβ
Our Rust implementation uses pre-allocated vectors and cache-friendly memory layouts:
fn viterbi(&self, attr_ids: &[Vec<u32>]) -> TaggingResult {
let n = attr_ids.len();
let num_labels = self.model.num_labels;
// Pre-allocated score matrix
let mut score = vec![vec![f64::NEG_INFINITY; num_labels]; n];
let mut back = vec![vec![0u32; num_labels]; n];
// Cache emission scores per position
let emission_t = self.model.emission_scores(&attr_ids[t]);
// Direct memory access in inner loop
for y in 0..num_labels {
for y_prev in 0..num_labels {
let trans = self.model.get_transition(y_prev as u32, y as u32);
// ...
}
}
}
3. Zero-Copy Where Possibleβ
PyO3 bindings allow efficient data transfer between Python and Rust without unnecessary copying.
4. Removed Dependencyβ
Removing python-crfsuite also means:
- Simpler installation (no C++ compiler needed)
- Smaller package size
- Fewer potential compatibility issues
Training Optimizationsβ
Beyond inference, we also optimized the CRF trainer in underthesea-core. The original Rust trainer was 7.2x slower than python-crfsuite for word segmentation. Through four key optimizations, we made it competitive β and even faster for some tasks.
CRF Training Algorithmβ
The trainer uses Limited-memory BFGS (L-BFGS) optimization with Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) extension for L1 regularization:
minimize: L(w) = -log P(y|x) + Ξ»ββwββ + Ξ»ββwββΒ²
Where Ξ»β = 1.0 (L1 coefficient) and Ξ»β = 0.001 (L2 coefficient).
The core computation is the forward-backward algorithm for computing:
- Partition function Z(x) via forward pass
- Marginal probabilities P(y_t | x) via forward-backward
- Gradient βL(w) = E_model[f] - E_empirical[f]
Complexity per sequence: O(n Γ LΒ²) where n is the sequence length and L is the number of labels.
Following CRFsuite's approach, we use scaled probability space instead of log-space:
// Instead of: log_alpha[t][y] = logsumexp(log_alpha[t-1] + log_trans + log_state)
// We use: alpha[t][y] = sum(alpha[t-1] * exp_trans) * exp_state * scale
Benefits:
- No log/exp in inner loops
- Better numerical stability with scaling factors
- Matches CRFsuite's performance characteristics
Starting Pointβ
| Task | python-crfsuite | underthesea-core (original) | Slowdown |
|---|---|---|---|
| Word Segmentation | 2m 34s | 18m 33s | 7.2x slower |
| POS Tagging | 4m 50s | 7m 21s | 1.5x slower |
Optimization 1: Flat Data Structure for Feature Lookupβ
The original used nested vectors (Vec<Vec<(u32, u32)>>) for feature lookup β each inner Vec separately heap-allocated, causing cache misses for large feature sets (562k features).
We flattened into contiguous arrays with offset indexing:
// Contiguous memory, excellent cache locality
attr_offsets: Vec<u32> // attr_id -> start index
attr_features_flat: Vec<(u32, u32)> // flattened (label_id, feature_id) pairs
// Lookup: O(1) with sequential memory access
let start = attr_offsets[attr_id];
let end = attr_offsets[attr_id + 1];
for i in start..end {
let (label_id, feature_id) = attr_features_flat[i];
// Process feature...
}
Result:
| Task | Before | After | Speedup |
|---|---|---|---|
| Word Segmentation | 18m 33s | 1m 49s | 10.2x |
| POS Tagging | 7m 21s | 5m 9s | 1.4x |
The larger speedup for word segmentation (562k features vs 37k for POS) confirms the feature lookup was the bottleneck.
Optimization 2: Loop Unrolling for Auto-Vectorizationβ
The forward-backward algorithm has O(n Γ LΒ²) inner loops. For POS tagging (16 labels = 256 transitions per timestep), we applied 4-way manual loop unrolling to enable SIMD auto-vectorization:
// 4-way unrolled for instruction-level parallelism
let chunks = num_labels / 4;
for i in 0..chunks {
let y = i * 4;
let a0 = alpha[curr + y];
let a1 = alpha[curr + y + 1];
let a2 = alpha[curr + y + 2];
let a3 = alpha[curr + y + 3];
let t0 = trans[trans_base + y];
let t1 = trans[trans_base + y + 1];
let t2 = trans[trans_base + y + 2];
let t3 = trans[trans_base + y + 3];
alpha[curr + y] = a0 + alpha_prev * t0;
alpha[curr + y + 1] = a1 + alpha_prev * t1;
alpha[curr + y + 2] = a2 + alpha_prev * t2;
alpha[curr + y + 3] = a3 + alpha_prev * t3;
}
Result: POS tagging (10 iterations) went from 25.7s to 17.58s β a 1.46x speedup.
Optimization 3: Unsafe Bounds-Check Eliminationβ
We used unsafe with get_unchecked for hot paths where indices are provably valid, eliminating Rust's bounds checks in tight loops:
// Safe but slow: 2 bounds checks per iteration
for y in 0..num_labels {
gradient[feature_id] += state_mexp[base + y];
}
// Unsafe but fast: 0 bounds checks
unsafe {
for y in 0..num_labels {
*gradient.get_unchecked_mut(feature_id) +=
*state_mexp.get_unchecked(base + y);
}
}
All unsafe blocks are guarded by loop bounds derived from array lengths, assertions, and algorithm invariants.
Optimization 4: Fused Operationsβ
Separate loops for related operations cause redundant memory traversals. We fused them:
// Before: 3 separate loops, 3 memory traversals
for y in 0..L { alpha[y] *= exp_state[y]; }
for y in 0..L { sum += alpha[y]; }
for y in 0..L { alpha[y] *= scale; }
// After: 1 fused loop + 1 normalization pass
let mut sum = 0.0;
for y in 0..L {
let val = alpha[y] * exp_state[y];
alpha[y] = val;
sum += val;
}
let scale = 1.0 / sum;
for y in 0..L { alpha[y] *= scale; }
Cumulative Optimization Impactβ
| Optimization | Word Seg Speedup | POS Tag Speedup |
|---|---|---|
| Baseline (original) | 1.0x | 1.0x |
| + Flat data structure | 10.2x | 1.4x |
| + Loop unrolling | 10.2x | 2.1x |
| + Unsafe bounds elim | ~10.2x | ~2.3x |
| Total | 10.2x | 2.3x |
Training Benchmark Results (200 iterations)β
| Task | Features | Labels | python-crfsuite | underthesea-core | Result |
|---|---|---|---|---|---|
| Word Segmentation | 562,885 | 2 | 2m 2s | 1m 38s | 1.24x faster |
| POS Tagging | 626,723 | 16 | 4m 3s | 4m 14s | ~equal (4% slower) |
Accuracy Verificationβ
| Task | Metric | python-crfsuite | underthesea-core |
|---|---|---|---|
| Word Segmentation | Syllable Accuracy | 98.89% | 98.89% |
| Word Segmentation | Word F1 | 98.00% | 98.00% |
| POS Tagging | Accuracy | 95.98% | 95.97% |
Accuracy is identical β optimizations only affected performance, not correctness.
What Didn't Workβ
We evaluated several additional optimizations that did not provide significant improvements:
- Explicit SIMD Intrinsics (AVX2): The inner loops process only 2-16 labels, too small for explicit SIMD to outperform the compiler's auto-vectorization with loop unrolling.
- Parallel Forward-Backward (Rayon): Thread-local gradient accumulation overhead from buffer allocation per sequence and gradient merging negated the parallelism benefits. Sequential processing with buffer reuse remains faster.
- Memory Pool for Temporary Buffers: Already implemented β the current implementation reuses buffers across sequences within each L-BFGS evaluation. Further pooling across evaluations showed minimal improvement.
- Compressed Sparse Features: The flat data structure with offset indexing already provides efficient sparse feature access β additional compression just adds decode overhead.
Key Insight: Different Tasks, Different Bottlenecksβ
| Feature Set Size | Bottleneck | Best Optimization |
|---|---|---|
| Large (500k+) | Feature lookup (cache misses) | Flat data structure |
| Small (<50k) | Forward-backward O(LΒ²) | Loop unrolling |
Why python-crfsuite Was Initially Fasterβ
CRFsuite (C implementation) already had:
- Hand-optimized sparse feature storage β similar to our flat structure
- SIMD-vectorized matrix operations β AVX/SSE intrinsics
- Cache-optimized memory layout β column-major for transitions
- Decades of optimization β mature codebase
Our flat data structure and loop unrolling effectively replicated these advantages in Rust.
Lessons Learnedβ
- Profile first β the bottleneck was different for each task
- Data structure matters β flat arrays beat nested vectors by 10x
- Cache locality is critical β sequential memory access enables hardware prefetching
- Unsafe Rust is justified β when correctness is provable and performance is critical
- Incremental migration reduces risk β migrating one task at a time allowed validation at each step
Migration Pathβ
The migration was done incrementally across 4 releases:
| Version | Changes |
|---|---|
| v9.2.2 | word_tokenize migrated |
| v9.2.3 | pos_tag, ner, chunking migrated |
| v9.2.4 | CRFTrainer migrated, removed unused files |
| v9.2.5 | Removed python-crfsuite dependency |
Model Compatibilityβ
The Rust implementation can load existing .crfsuite model files trained by python-crfsuite. No retraining is required.
# Works with both old and new models
from underthesea import word_tokenize
word_tokenize("HΓ Nα»i lΓ thα»§ ΔΓ΄ cα»§a Viα»t Nam")
# ['HΓ Nα»i', 'lΓ ', 'thα»§ ΔΓ΄', 'cα»§a', 'Viα»t Nam']
Conclusionβ
By rewriting our CRF implementation in Rust and unifying the pipeline, we achieved:
- 12-19% faster inference across all CRF-based tasks
- 1.24x faster training for word segmentation (10x from original Rust implementation)
- Identical accuracy β no degradation from the migration
- Simpler dependency tree (no python-crfsuite / C++ compiler needed)
- Better maintainability with a single Rust codebase
The full implementation is available in underthesea-core.
Try It Outβ
pip install underthesea==9.2.5
from underthesea import word_tokenize, pos_tag, ner, chunk
word_tokenize("Viα»t Nam") # 20% faster!
Appendixβ
Hyperparametersβ
| Parameter | Value | Description |
|---|---|---|
| c1 (L1) | 1.0 | L1 regularization coefficient |
| c2 (L2) | 0.001 | L2 regularization coefficient |
| max_iterations | 200 | Maximum L-BFGS iterations |
| linesearch | Backtracking | Line search algorithm for OWL-QN |
| max_step_size | 1e20 | Allow large steps (critical for convergence) |
Hardwareβ
| Component | Specification |
|---|---|
| CPU | AMD EPYC 7713 64-Core Processor |
| Platform | Linux |
| Rust | 1.75+ (release mode with LTO) |
| Python | 3.12 |
Code Referencesβ
| File | Description |
|---|---|
underthesea_core/src/crf/trainer.rs | Main CRF trainer implementation |
underthesea_core/src/crf/model.rs | CRF model structure |
tre-1/scripts/train.py | POS tagger training script |
tre-1/scripts/train_word_segmentation.py | Word segmentation training script |
Detailed Benchmark Results (2026-01-31)β
10-Iteration Tests:
| Task | Trainer | Training Time | Accuracy |
|---|---|---|---|
| POS Tagging | python-crfsuite | 12.96s | 78.37% |
| POS Tagging | underthesea-core | 18.51s | 75.42% |
| Word Segmentation | python-crfsuite | 6.78s | 81.44% F1 |
| Word Segmentation | underthesea-core | 11.99s | 82.81% F1 |
200-Iteration Tests:
| Task | Trainer | Training Time | Accuracy |
|---|---|---|---|
| POS Tagging | python-crfsuite | 243.22s | 95.98% |
| POS Tagging | underthesea-core | 254.07s | 95.97% |
| Word Segmentation | python-crfsuite | 121.69s | 98.89% / 98.00% F1 |
| Word Segmentation | underthesea-core | 98.34s | 98.89% / 98.00% F1 |
