Skip to main content
Version: 9.2.11

Contributing Guide

Thank you for your interest in contributing to Underthesea! This guide will help you get started.

Types of Contributions​

Bug Reports​

  • Check existing GitHub Issues first
  • Include Python version, OS, and Underthesea version
  • Provide minimal code to reproduce the issue
  • Include full error traceback

Bug Fixes​

  • Reference the issue number in your PR
  • Include tests that demonstrate the fix
  • Update documentation if needed

New Features​

  • Open an issue to discuss the feature first
  • Follow the existing code style
  • Add tests and documentation

Documentation​

  • Fix typos and improve clarity
  • Add examples and tutorials
  • Translate documentation

Development Setup​

Prerequisites​

  • Python 3.9 or higher
  • Git
  • uv (recommended) or pip

Clone and Install​

# Clone the repository
git clone https://github.com/undertheseanlp/underthesea.git
cd underthesea

# Create virtual environment with uv
uv venv
source .venv/bin/activate

# Install in development mode
uv pip install -e ".[dev]"

macOS ARM64 (Apple Silicon)​

Build the Rust extension:

cd extensions/underthesea_core
uv pip install maturin
maturin develop
cd ../..

Code Style​

Linting​

We use Ruff for linting:

# Check for issues
ruff check underthesea/

# Auto-fix issues
ruff check underthesea/ --fix

Configuration​

Ruff configuration is in pyproject.toml.

Testing​

Test Categories​

CommandDescription
tox -e lintLinting with Ruff
tox -e coreCore module tests
tox -e deepDeep learning tests
tox -e promptPrompt model tests
tox -e langdetectLanguage detection tests

Running Specific Tests​

# Word tokenization tests
uv run python -m unittest discover tests.pipeline.word_tokenize

# POS tagging tests
uv run python -m unittest discover tests.pipeline.pos_tag

# NER tests
uv run python -m unittest tests.pipeline.ner.test_ner

# Classification tests
uv run python -m unittest tests.pipeline.classification.test_bank

# Translation tests
uv run python -m unittest discover tests.pipeline.translate

Writing Tests​

  • Place tests in the tests/ directory
  • Mirror the source structure
  • Use Python's unittest framework
  • Include both positive and edge case tests
import unittest
from underthesea import word_tokenize

class TestWordTokenize(unittest.TestCase):
def test_basic(self):
result = word_tokenize("Xin chΓ o Việt Nam")
self.assertEqual(result, ['Xin', 'chΓ o', 'Việt Nam'])

def test_empty_string(self):
result = word_tokenize("")
self.assertEqual(result, [])

if __name__ == '__main__':
unittest.main()

Pull Request Process​

Before Submitting​

  1. Update your branch: Rebase on latest main
  2. Run linting: ruff check underthesea/
  3. Run tests: tox -e core
  4. Update docs: If adding features

PR Guidelines​

  • Use clear, descriptive titles
  • Reference related issues
  • Describe changes and motivation
  • Include test results
  • Add screenshots for UI changes

PR Template​

## Description
Brief description of changes

## Related Issues
Fixes #123

## Changes
- Added X feature
- Fixed Y bug
- Updated Z documentation

## Testing
- [ ] Linting passes
- [ ] Unit tests pass
- [ ] Manual testing done

## Documentation
- [ ] Updated relevant docs
- [ ] Added docstrings

Project Structure​

underthesea/
β”œβ”€β”€ underthesea/ # Main package
β”‚ β”œβ”€β”€ pipeline/ # NLP modules
β”‚ β”œβ”€β”€ models/ # Model implementations
β”‚ β”œβ”€β”€ datasets/ # Built-in datasets
β”‚ β”œβ”€β”€ corpus/ # Corpus handling
β”‚ └── cli.py # CLI commands
β”œβ”€β”€ tests/ # Test files
β”œβ”€β”€ docs/ # Documentation
β”œβ”€β”€ extensions/ # Rust extension, apps
└── pyproject.toml # Project configuration

CLI Commands​

# List available data
underthesea list-data

# List available models
underthesea list-model

# Download data
underthesea download-data VNTC

Getting Help​

Code of Conduct​

  • Be respectful and inclusive
  • Focus on constructive feedback
  • Help newcomers feel welcome