High-performance phylogenetic tree inference from DNA/RNA sequences using three complementary methods: UPGMA, BioNJ, and Maximum Likelihood with automatic model selection.
- 3 phylogenetic methods: UPGMA, BioNJ, Maximum Likelihood
- Automatic model selection: Tests 5 substitution models (JC69, K80, F81, HKY85, GTR) + gamma rate variation
- Bootstrap analysis: Confidence values for all branches
- Multiple outputs: Newick trees, ASCII visualizations, comparison summary
- Pre-aligned support: Skip alignment step for faster analysis
- Site pattern compression: speedup for ML calculations
- Numba JIT compilation: speedup for likelihood calculations
- Combined optimization: total speedup for ML inference
- MUSCLE integration: Automatic sequence alignment with 30-minute timeout
- Test datasets included: 4 curated datasets (mammals, birds, fish)
- Comprehensive documentation: Complete usage guides and API docs
- Production-ready: Error handling, logging, organized outputs
# Clone repository
git clone https://github.com/yourusername/rrna-phylo.git
cd rrna-phylo
# Install dependencies
pip install -r requirements.txt
# Interactive menu (easiest)
python rrna_phylo_app.py
# CLI - Build all 3 trees
python rrna_phylo_cli.py data/test/birds_test_aligned.fasta --pre-aligned
# CLI - Maximum Likelihood only with bootstrap
python rrna_phylo_cli.py data/test/mammals_test_aligned.fasta --pre-aligned --method ml --bootstrap 100# Use pre-aligned birds dataset (35 species, ~15 seconds)
python rrna_phylo_cli.py backend/data/test/birds_test_aligned.fasta --pre-aligned --method ml
# Output in: backend/results/birds_test_aligned/- Speed: Fastest (<1s for 100 sequences)
- Assumption: Molecular clock (constant evolution rate)
- Best for: Closely related sequences, initial exploration
- Speed: Fast (<1s for 100 sequences)
- Assumption: No molecular clock
- Best for: General-purpose phylogeny, variance-weighted accuracy
- Speed: ~15-30s for 35-50 sequences (with pre-alignment)
- Assumption: Statistical model of sequence evolution
- Best for: Publication-quality trees, rigorous inference
- Features:
- Automatic model selection (BIC)
- NNI/SPR tree search
- Bootstrap support values
- Gamma rate variation
Located in backend/data/test/:
| Dataset | Sequences | Alignment | Speed (pre-aligned) | Use Case |
|---|---|---|---|---|
| birds_test | 35 | 1,865 bp | ~15 sec | Recommended - diverse birds + turtle outgroup |
| mammals_test | 33 | 2,132 bp | ~10 sec | Mammalian phylogeny |
| cartilaginous_fish_test | 28 | 1,827 bp | ~8 sec | Sharks and rays |
| Arcosauria_test | 111 | 1,916 bp | ~30 sec | Stress testing only (too diverse) |
Each dataset has 2 versions:
*_test.fasta- Unaligned (runs MUSCLE, slower)*_test_aligned.fasta- Pre-aligned (skip MUSCLE, 5-30x faster!) β‘
See: backend/data/test/README.md for complete dataset documentation
python rrna_phylo_cli.py sequences.fastaOutput: backend/results/sequences/
upgma_tree.nwk,bionj_tree.nwk,ml_tree.nwk(Newick format)upgma_tree.txt,bionj_tree.txt,ml_tree.txt(ASCII visualization)summary.txt(comparison of all methods)
# UPGMA only
python rrna_phylo_cli.py sequences.fasta --method upgma
# BioNJ only
python rrna_phylo_cli.py sequences.fasta --method bionj
# Maximum Likelihood only
python rrna_phylo_cli.py sequences.fasta --method ml# Skip MUSCLE alignment (5-30x faster!)
python rrna_phylo_cli.py aligned.fasta --pre-aligned# 100 replicates (recommended for publication)
python rrna_phylo_cli.py sequences.fasta --bootstrap 100
# 10 replicates (quick test)
python rrna_phylo_cli.py sequences.fasta --bootstrap 10 --method upgma# Newick only (no ASCII trees)
python rrna_phylo_cli.py sequences.fasta --output-format newick
# ASCII only
python rrna_phylo_cli.py sequences.fasta --output-format ascii
# Both (default)
python rrna_phylo_cli.py sequences.fasta --output-format bothpython rrna_phylo_app.pyFeatures:
- π File browser (shows all files in
data/test/) - β‘ Quick build (one-click with defaults)
- ποΈ Custom build (choose all options)
- 𧬠Pre-aligned sequence detection
- π Results viewer
- π§Ή Cleanup tool
- π‘ Built-in help
No command-line flags to remember!
backend/results/
βββ [filename]/
βββ upgma_tree.nwk # UPGMA tree (Newick)
βββ bionj_tree.nwk # BioNJ tree (Newick)
βββ ml_tree.nwk # ML tree (Newick)
βββ upgma_tree.txt # UPGMA tree (ASCII)
βββ bionj_tree.txt # BioNJ tree (ASCII)
βββ ml_tree.txt # ML tree (ASCII)
βββ summary.txt # Comparison summary
ββ Gallus_gallus (Chicken)
βββββββββββ€
β ββ Meleagris_gallopavo (Turkey)
βββββββ€
β ββ Anas_platyrhynchos (Mallard)
βββββββββββ€
ββ Struthio_camelus (Ostrich)
| Dataset | Sequences | Unaligned (with MUSCLE) | Pre-aligned | Speedup |
|---|---|---|---|---|
| Birds | 35 | ~1.5 min | 15 sec | 6x β‘ |
| Mammals | 33 | ~2 min | 10 sec | 12x β‘ |
| Arcosauria | 111 | ~19 min | 30 sec | 38x β‘ |
| Fish | 28 | ~1 min | 8 sec | 7.5x β‘ |
Recommendation: Use pre-aligned datasets for testing and development!
| Method | Time | Bootstrap (100 reps) |
|---|---|---|
| UPGMA | <1 sec | ~2 min |
| BioNJ | <1 sec | ~2 min |
| ML (NNI) | ~15 sec | ~25 min |
rrna-phylo/
βββ README.md # This file
βββ requirements.txt # Core dependencies
βββ requirements-dev.txt # Development dependencies
βββ docs/ # Documentation
β βββ CHANGELOG.md # Project history
β βββ BIRDS_DATASET_SUMMARY.md # Birds dataset details
βββ backend/
β βββ rrna_phylo_cli.py # Main CLI entry point
β βββ rrna_phylo_app.py # Interactive menu
β βββ muscle.exe # MUSCLE aligner (Windows)
β βββ data/
β β βββ test/ # Test datasets
β β β βββ birds_test.fasta
β β β βββ birds_test_aligned.fasta β
β β β βββ mammals_test.fasta
β β β βββ mammals_test_aligned.fasta
β β β βββ cartilaginous_fish_test.fasta
β β β βββ cartilaginous_fish_test_aligned.fasta
β β β βββ README.md # Dataset documentation
β β βββ README.md
β βββ rrna_phylo/ # Core package
β βββ alignment/ # MUSCLE integration
β β βββ muscle_aligner.py
β βββ core/ # Tree structures
β β βββ builder.py
β β βββ tree.py
β β βββ sequence_type.py
β βββ distance/ # Distance calculations
β β βββ distance.py
β βββ methods/ # Tree building methods
β β βββ upgma.py
β β βββ bionj.py
β βββ models/ # Maximum Likelihood
β β βββ ml_tree_level3.py # ML with optimizations
β β βββ rate_matrices.py # GTR, HKY, K80, etc.
β β βββ model_selection.py # BIC-based selection
β βββ consensus/ # Tree comparison
β β βββ tree_distance.py # Robinson-Foulds
β β βββ bipartitions.py
β βββ visualization/ # Tree output
β βββ tree_drawer.py # ASCII trees
Python 3.9+
numpy >= 1.21.0
scipy >= 1.7.0
numba >= 0.54.0
- MUSCLE v5.1 (required for alignment)
- Download: https://drive5.com/muscle/
- Place in PATH or project root
- Biopython >= 1.79 (for sequence validation)
"No module named 'rrna_phylo'"
# Make sure you're in the project root, not backend/
cd /path/to/rrna-phylo
python backend/rrna_phylo_cli.py ...MUSCLE timeout
# Default timeout is 30 minutes
# For very large datasets, the timeout may trigger
# Pre-align sequences externally and use --pre-aligned flagOpenMP Library Conflict
# Already handled automatically
# If still occurring, set environment variable:
export KMP_DUPLICATE_LIB_OK=TRUE # Linux/Mac
set KMP_DUPLICATE_LIB_OK=TRUE # WindowsML is slow
# Use pre-aligned datasets (skip MUSCLE)
# Skip bootstrap for testing
# Use smaller datasets for developmentREADME.md- This file (getting started)backend/data/test/README.md- Test dataset documentationdocs/CHANGELOG.md- Complete project history and improvementsdocs/BIRDS_DATASET_SUMMARY.md- Birds dataset creation and rationale
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - See LICENSE file for details
If you use rRNA-Phylo in your research, please cite:
rRNA-Phylo: High-performance phylogenetic analysis for ribosomal RNA
https://github.com/yourusername/rrna-phylo
Built with Claude Code (Sonnet 4.5) demonstrating systematic LLM-assisted development through skills and specialized agents.
Key optimizations:
- Site pattern compression (8-10x speedup)
- Numba JIT compilation (9x speedup)
- Combined: 72x total speedup for ML likelihood calculations
- Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39(4), 783-791.
- Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm. Molecular Biology and Evolution, 14(7), 685-695.
- Posada, D., & Crandall, K. A. (1998). MODELTEST: testing the model of DNA substitution. Bioinformatics, 14(9), 817-818.
- Edgar, R.C. (2022). MUSCLE v5. Nature Communications, 13, 6968.
- Numba Documentation: https://numba.pydata.org/