Rust-Powered Text Classification
In underthesea v9.2.9, we've completely rewritten the text classification pipeline using our Rust-based TextClassifier. This delivers up to 273x faster inference compared to the previous sklearn-based implementation.
Backgroundβ
Text classification in underthesea supports two domains:
- General: News categorization (10 categories)
- Bank: Banking intent classification (14 categories)
Previously, we used scikit-learn's TfidfVectorizer + LinearSVC loaded via joblib. While accurate, this approach had significant overhead.
The Architecture Changeβ
Before (sklearn-based)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Input βββββΆβTfidfVectorizerββββΆβ LinearSVC β β
β β Text β β (sklearn) β β (sklearn) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β
β joblib.load() joblib.load() β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Loading two separate pickle files, with Python-based vectorization and inference.
After (Rust-based)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β Input βββββΆβ TextClassifier β β
β β Text β β TF-IDF + LinearSVC (Rust) β β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β β
β single .bin file β
β underthesea-core β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Single binary model file, vectorization and inference fused in Rust.
Code Changesβ
The API remains unchanged:
from underthesea import classify
# General classification
classify("Viα»t Nam vΓ΄ Δα»ch AFF Cup")
# "The thao"
# Bank domain
classify("LΓ£i suαΊ₯t tiαΊΏt kiα»m bao nhiΓͺu?", domain="bank")
# ['INTEREST_RATE']
Internally, the implementation is much simpler:
Before:
import joblib
from underthesea.pipeline.classification import bank
vectorizer = joblib.load("vectorizer.pkl")
classifier = joblib.load("classifier.pkl")
features = vectorizer.transform([text])
prediction = classifier.predict(features)
After:
from underthesea_core import TextClassifier
classifier = TextClassifier.load("model.bin")
prediction = classifier.predict(text)
Benchmark Resultsβ
Tested on the same hardware with batch inference:
| Domain | sklearn | Rust | Speedup |
|---|---|---|---|
| General | 1,228 samples/sec | 66,678 samples/sec | 54x |
| Bank | 244 samples/sec | 66,678 samples/sec | 273x |
Single sample latency: 4ms β 0.465ms
Why Is It Faster?β
1. Fused Pipelineβ
TF-IDF vectorization and SVM inference run in a single Rust function call, eliminating Python overhead between stages.
2. Optimized Sparse Operationsβ
pub fn predict(&self, text: &str) -> String {
// Tokenize and hash features in one pass
let features = self.vectorizer.transform(text);
// Sparse dot product with pre-sorted indices
let scores = self.svm.decision_function(&features);
self.classes[scores.argmax()].clone()
}
3. Single File Modelβ
One .bin file instead of multiple pickle files:
- Faster loading
- Atomic deployment
- Smaller size (JSON-based, not pickle)
4. No Python GIL Contentionβ
Rust code releases the GIL during computation, enabling true parallelism.
The Modelsβ
sen-classifier-generalβ
General Vietnamese news classification model trained on VNTC dataset.
Training Data: VNTC (Vietnamese News Text Classification)
- 33,759 training samples
- 50,373 test samples
- 10 news categories
Categories:
| Label | Vietnamese | English |
|---|---|---|
| Chinh tri Xa hoi | ChΓnh trα» XΓ£ hα»i | Politics/Society |
| Doi song | Δα»i sα»ng | Lifestyle |
| Khoa hoc | Khoa hα»c | Science |
| Kinh doanh | Kinh doanh | Business |
| Phap luat | PhΓ‘p luαΊt | Law |
| Suc khoe | Sα»©c khα»e | Health |
| The gioi | ThαΊΏ giα»i | World |
| The thao | Thα» thao | Sports |
| Van hoa | VΔn hΓ³a | Culture |
| Vi tinh | Vi tΓnh | Technology |
Performance:
- Accuracy: 92.49%
- F1 (weighted): 92.40%
- Training time: 37.6s
sen-classifier-bankβ
Vietnamese banking intent classification model trained on UTS2017_Bank dataset.
Training Data: UTS2017_Bank
- 1,581 training samples
- 396 test samples
- 14 banking categories
Categories:
| Label | Description | Samples |
|---|---|---|
| CUSTOMER_SUPPORT | Customer support queries | 774 |
| TRADEMARK | Brand/trademark mentions | 697 |
| LOAN | Loan services | 73 |
| INTERNET_BANKING | Internet banking | 69 |
| CARD | Card services | 66 |
| INTEREST_RATE | Interest rates | 58 |
| PROMOTION | Promotions | 56 |
| DISCOUNT | Discounts | 40 |
| MONEY_TRANSFER | Money transfer | 37 |
| OTHER | Other queries | 70 |
| PAYMENT | Payment services | 17 |
| SAVING | Savings | 12 |
| ACCOUNT | Account services | 5 |
| SECURITY | Security | 3 |
Performance:
- Accuracy: 75.76% (+3.29% vs previous sonar_core_1)
- F1 (weighted): 72.70%
- Training time: 0.13s
Training Pipelineβ
Both models use a 3-stage TF-IDF + Linear SVM pipeline:
Input Text
β
βββββββββββββββββββββββββββββββββββββββ
β CountVectorizer β
β - max_features: 20,000 β
β - ngram_range: (1, 2) β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β TfidfTransformer β
β - use_idf: True β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β LinearSVC β
β - C: 1.0 β
β - max_iter: 2000 β
β - loss: squared_hinge β
βββββββββββββββββββββββββββββββββββββββ
β
Predicted Label + Confidence
Key design decisions:
- Syllable-level tokenization: No word segmentation for speed
- Character n-grams (1-2): Captures Vietnamese morphology
- 20K vocabulary: Balances accuracy and model size
- Linear SVM: Fast training, works well with sparse high-dimensional data
Training code: sen-1/src/scripts/train_vntc.py
Label Format Changeβ
Labels now use Title case with spaces:
| Old | New |
|---|---|
the_thao | The thao |
kinh_doanh | Kinh doanh |
vi_tinh | Vi tinh |
Bank domain labels remain uppercase: INTEREST_RATE, MONEY_TRANSFER, etc.
Simplified Codebaseβ
We consolidated three separate modules into one:
Before:
classification/
βββ bank/
β βββ __init__.py
βββ sonar_core_1/
β βββ __init__.py
βββ vntc/
β βββ __init__.py
βββ __init__.py
After:
classification/
βββ __init__.py # Everything here
βββ classification_prompt.py
~190 lines removed, single source of truth for model URLs and loading logic.
Try It Outβ
pip install underthesea==9.2.9
from underthesea import classify
# 273x faster!
classify("Thα» trΖ°α»ng chα»©ng khoΓ‘n tΔng Δiα»m mαΊ‘nh")
# "Kinh doanh"
classify.labels
# ['Chinh tri Xa hoi', 'Doi song', 'Khoa hoc', 'Kinh doanh', ...]
classify("Mα» thαΊ» tΓn dα»₯ng", domain="bank")
# ['CARD']
classify.bank.labels
# ['ACCOUNT', 'CARD', 'CUSTOMER_SUPPORT', ...]
Linksβ
- PR #935 - Classification pipeline refactor
- Sen-1 - Training code and technical report
- underthesea-core - Rust extension on PyPI
