Skip to main content

Rust-Powered Text Classification

Β· 4 min read
Vu Anh
Creator of Underthesea

In underthesea v9.2.9, we've completely rewritten the text classification pipeline using our Rust-based TextClassifier. This delivers up to 273x faster inference compared to the previous sklearn-based implementation.

Background​

Text classification in underthesea supports two domains:

  • General: News categorization (10 categories)
  • Bank: Banking intent classification (14 categories)

Previously, we used scikit-learn's TfidfVectorizer + LinearSVC loaded via joblib. While accurate, this approach had significant overhead.

The Architecture Change​

Before (sklearn-based)​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Python β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Input │───▢│TfidfVectorizer───▢│ LinearSVC β”‚ β”‚
β”‚ β”‚ Text β”‚ β”‚ (sklearn) β”‚ β”‚ (sklearn) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ joblib.load() joblib.load() β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Loading two separate pickle files, with Python-based vectorization and inference.

After (Rust-based)​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Python β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Input │───▢│ TextClassifier β”‚ β”‚
β”‚ β”‚ Text β”‚ β”‚ TF-IDF + LinearSVC (Rust) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ single .bin file β”‚
β”‚ underthesea-core β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Single binary model file, vectorization and inference fused in Rust.

Code Changes​

The API remains unchanged:

from underthesea import classify

# General classification
classify("Việt Nam vΓ΄ Δ‘α»‹ch AFF Cup")
# "The thao"

# Bank domain
classify("LΓ£i suαΊ₯t tiαΊΏt kiệm bao nhiΓͺu?", domain="bank")
# ['INTEREST_RATE']

Internally, the implementation is much simpler:

Before:

import joblib
from underthesea.pipeline.classification import bank

vectorizer = joblib.load("vectorizer.pkl")
classifier = joblib.load("classifier.pkl")
features = vectorizer.transform([text])
prediction = classifier.predict(features)

After:

from underthesea_core import TextClassifier

classifier = TextClassifier.load("model.bin")
prediction = classifier.predict(text)

Benchmark Results​

Tested on the same hardware with batch inference:

DomainsklearnRustSpeedup
General1,228 samples/sec66,678 samples/sec54x
Bank244 samples/sec66,678 samples/sec273x

Single sample latency: 4ms β†’ 0.465ms

Why Is It Faster?​

1. Fused Pipeline​

TF-IDF vectorization and SVM inference run in a single Rust function call, eliminating Python overhead between stages.

2. Optimized Sparse Operations​

pub fn predict(&self, text: &str) -> String {
// Tokenize and hash features in one pass
let features = self.vectorizer.transform(text);

// Sparse dot product with pre-sorted indices
let scores = self.svm.decision_function(&features);

self.classes[scores.argmax()].clone()
}

3. Single File Model​

One .bin file instead of multiple pickle files:

  • Faster loading
  • Atomic deployment
  • Smaller size (JSON-based, not pickle)

4. No Python GIL Contention​

Rust code releases the GIL during computation, enabling true parallelism.

The Models​

sen-classifier-general​

General Vietnamese news classification model trained on VNTC dataset.

Training Data: VNTC (Vietnamese News Text Classification)

  • 33,759 training samples
  • 50,373 test samples
  • 10 news categories

Categories:

LabelVietnameseEnglish
Chinh tri Xa hoiChΓ­nh trα»‹ XΓ£ hα»™iPolitics/Society
Doi songĐời sα»‘ngLifestyle
Khoa hocKhoa họcScience
Kinh doanhKinh doanhBusiness
Phap luatPhΓ‘p luαΊ­tLaw
Suc khoeSức khỏeHealth
The gioiThαΊΏ giα»›iWorld
The thaoThể thaoSports
Van hoaVăn hóaCulture
Vi tinhVi tΓ­nhTechnology

Performance:

  • Accuracy: 92.49%
  • F1 (weighted): 92.40%
  • Training time: 37.6s

sen-classifier-bank​

Vietnamese banking intent classification model trained on UTS2017_Bank dataset.

Training Data: UTS2017_Bank

  • 1,581 training samples
  • 396 test samples
  • 14 banking categories

Categories:

LabelDescriptionSamples
CUSTOMER_SUPPORTCustomer support queries774
TRADEMARKBrand/trademark mentions697
LOANLoan services73
INTERNET_BANKINGInternet banking69
CARDCard services66
INTEREST_RATEInterest rates58
PROMOTIONPromotions56
DISCOUNTDiscounts40
MONEY_TRANSFERMoney transfer37
OTHEROther queries70
PAYMENTPayment services17
SAVINGSavings12
ACCOUNTAccount services5
SECURITYSecurity3

Performance:

  • Accuracy: 75.76% (+3.29% vs previous sonar_core_1)
  • F1 (weighted): 72.70%
  • Training time: 0.13s

Training Pipeline​

Both models use a 3-stage TF-IDF + Linear SVM pipeline:

Input Text
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CountVectorizer β”‚
β”‚ - max_features: 20,000 β”‚
β”‚ - ngram_range: (1, 2) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TfidfTransformer β”‚
β”‚ - use_idf: True β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LinearSVC β”‚
β”‚ - C: 1.0 β”‚
β”‚ - max_iter: 2000 β”‚
β”‚ - loss: squared_hinge β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Predicted Label + Confidence

Key design decisions:

  • Syllable-level tokenization: No word segmentation for speed
  • Character n-grams (1-2): Captures Vietnamese morphology
  • 20K vocabulary: Balances accuracy and model size
  • Linear SVM: Fast training, works well with sparse high-dimensional data

Training code: sen-1/src/scripts/train_vntc.py

Label Format Change​

Labels now use Title case with spaces:

OldNew
the_thaoThe thao
kinh_doanhKinh doanh
vi_tinhVi tinh

Bank domain labels remain uppercase: INTEREST_RATE, MONEY_TRANSFER, etc.

Simplified Codebase​

We consolidated three separate modules into one:

Before:

classification/
β”œβ”€β”€ bank/
β”‚ └── __init__.py
β”œβ”€β”€ sonar_core_1/
β”‚ └── __init__.py
β”œβ”€β”€ vntc/
β”‚ └── __init__.py
└── __init__.py

After:

classification/
β”œβ”€β”€ __init__.py # Everything here
└── classification_prompt.py

~190 lines removed, single source of truth for model URLs and loading logic.

Try It Out​

pip install underthesea==9.2.9
from underthesea import classify

# 273x faster!
classify("Thα»‹ trường chα»©ng khoΓ‘n tΔƒng Δ‘iểm mαΊ‘nh")
# "Kinh doanh"

classify.labels
# ['Chinh tri Xa hoi', 'Doi song', 'Khoa hoc', 'Kinh doanh', ...]

classify("Mở thαΊ» tΓ­n dα»₯ng", domain="bank")
# ['CARD']

classify.bank.labels
# ['ACCOUNT', 'CARD', 'CUSTOMER_SUPPORT', ...]