Contributing Guide
Thank you for your interest in contributing to Underthesea! This guide will help you get started.
Types of Contributionsβ
Bug Reportsβ
- Check existing GitHub Issues first
- Include Python version, OS, and Underthesea version
- Provide minimal code to reproduce the issue
- Include full error traceback
Bug Fixesβ
- Reference the issue number in your PR
- Include tests that demonstrate the fix
- Update documentation if needed
New Featuresβ
- Open an issue to discuss the feature first
- Follow the existing code style
- Add tests and documentation
Documentationβ
- Fix typos and improve clarity
- Add examples and tutorials
- Translate documentation
Development Setupβ
Prerequisitesβ
- Python 3.9 or higher
- Git
- uv (recommended) or pip
Clone and Installβ
# Clone the repository
git clone https://github.com/undertheseanlp/underthesea.git
cd underthesea
# Create virtual environment with uv
uv venv
source .venv/bin/activate
# Install in development mode
uv pip install -e ".[dev]"
macOS ARM64 (Apple Silicon)β
Build the Rust extension:
cd extensions/underthesea_core
uv pip install maturin
maturin develop
cd ../..
Code Styleβ
Lintingβ
We use Ruff for linting:
# Check for issues
ruff check underthesea/
# Auto-fix issues
ruff check underthesea/ --fix
Configurationβ
Ruff configuration is in pyproject.toml.
Testingβ
Test Categoriesβ
| Command | Description |
|---|---|
tox -e lint | Linting with Ruff |
tox -e core | Core module tests |
tox -e deep | Deep learning tests |
tox -e prompt | Prompt model tests |
tox -e langdetect | Language detection tests |
Running Specific Testsβ
# Word tokenization tests
uv run python -m unittest discover tests.pipeline.word_tokenize
# POS tagging tests
uv run python -m unittest discover tests.pipeline.pos_tag
# NER tests
uv run python -m unittest tests.pipeline.ner.test_ner
# Classification tests
uv run python -m unittest tests.pipeline.classification.test_bank
# Translation tests
uv run python -m unittest discover tests.pipeline.translate
Writing Testsβ
- Place tests in the
tests/directory - Mirror the source structure
- Use Python's
unittestframework - Include both positive and edge case tests
import unittest
from underthesea import word_tokenize
class TestWordTokenize(unittest.TestCase):
def test_basic(self):
result = word_tokenize("Xin chΓ o Viα»t Nam")
self.assertEqual(result, ['Xin', 'chΓ o', 'Viα»t Nam'])
def test_empty_string(self):
result = word_tokenize("")
self.assertEqual(result, [])
if __name__ == '__main__':
unittest.main()
Pull Request Processβ
Before Submittingβ
- Update your branch: Rebase on latest
main - Run linting:
ruff check underthesea/ - Run tests:
tox -e core - Update docs: If adding features
PR Guidelinesβ
- Use clear, descriptive titles
- Reference related issues
- Describe changes and motivation
- Include test results
- Add screenshots for UI changes
PR Templateβ
## Description
Brief description of changes
## Related Issues
Fixes #123
## Changes
- Added X feature
- Fixed Y bug
- Updated Z documentation
## Testing
- [ ] Linting passes
- [ ] Unit tests pass
- [ ] Manual testing done
## Documentation
- [ ] Updated relevant docs
- [ ] Added docstrings
Project Structureβ
underthesea/
βββ underthesea/ # Main package
β βββ pipeline/ # NLP modules
β βββ models/ # Model implementations
β βββ datasets/ # Built-in datasets
β βββ corpus/ # Corpus handling
β βββ cli.py # CLI commands
βββ tests/ # Test files
βββ docs/ # Documentation
βββ extensions/ # Rust extension, apps
βββ pyproject.toml # Project configuration
CLI Commandsβ
# List available data
underthesea list-data
# List available models
underthesea list-model
# Download data
underthesea download-data VNTC
Getting Helpβ
Code of Conductβ
- Be respectful and inclusive
- Focus on constructive feedback
- Help newcomers feel welcome