-
Notifications
You must be signed in to change notification settings - Fork 68
Add MMseqs Colabfold Search GPU Support #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
bf0cb6d
67ded38
b830528
84604aa
abe18b8
19f023a
5047707
0a1f47d
a84c295
f97c5e6
792db45
2e8b218
1cc5445
585efc3
b0bed32
7c67723
5c0a5e3
eb30ae5
d93255f
a40831a
58eb3fd
29a7325
afe7ae9
952e80c
18d5817
6ff19b1
83ec51c
ce9959c
32453a2
45c09f2
00d421f
d1ee15c
f2b0600
96812db
7e168d5
ab3f11a
4208955
4a823e3
c1e51de
eb5afc1
e41c5d8
6523786
3efd7dc
0e064ff
d76f384
eed63bd
95428af
68ab6d7
b650853
d0ce03d
c2d830e
78164b3
189ae02
bda02f9
363236b
3fdc5ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,237 @@ | ||
| # Using padded databases in proteinfold | ||
|
|
||
| Proteinfold can make use of GPU MSA search for faster searching. However, this requires creating padded databases for the GPU hardware you wish to use, and setting the appropriate flags. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - mmseqs-gpu | ||
| - uniref30 database | ||
| - colabfold envdb database | ||
| - NVIDIA Ampere GPU or newer | ||
| - CUDA 12.4 or newer | ||
|
|
||
| ## Obtaining MMseqs-GPU | ||
|
|
||
| MMseqs has two x86 builds available for Linux. This requires the GPU version of MMseqs. It can be obtained via this command: | ||
|
|
||
| ```bash | ||
| wget https://github.com/soedinglab/MMseqs2/releases/download/18-8cc5c/mmseqs-linux-gpu.tar.gz | ||
| tar xvf mmseqs-linux-gpu.tar.gz | ||
| ``` | ||
|
|
||
| ## Database structure | ||
|
|
||
| Proteinfold can make use of the `--db` flag to load in all required databases. In order to load the padded databases, the database should be structured as such. Below is a truncated version of the database tree. It is important to note that the padded database files have the same prefix as the CPU files. | ||
|
|
||
| ``` | ||
| . | ||
| ├── colabfold_envdb | ||
| │ ├── colabfold_envdb_202108_sample_h.tsv | ||
| │ ├── colabfold_envdb_202108_sample_seq.tsv | ||
| │ └── colabfold_envdb_202108_sample.tsv | ||
| ├── colabfold_envdb_padded | ||
| │ ├── colabfold_envdb_202108_db_seq_h.index | ||
| │ ├── colabfold_envdb_202108_db_seq.index | ||
| │ └── colabfold_envdb_202108_db_seq.lookup | ||
| ├── colabfold_uniref30 | ||
| │ ├── uniref30_2302_db_seq_h.dbtype | ||
| │ ├── uniref30_2302_db_seq_h.index | ||
| │ ├── uniref30_2302_db_seq.index | ||
| │ └── uniref30_2302_db_taxonomy | ||
| └── colabfold_uniref30_padded | ||
| ├── uniref30_2302_db_seq_h.index | ||
| ├── uniref30_2302_db_seq.index | ||
| └── uniref30_2302_db_taxonomy | ||
| ``` | ||
|
|
||
| ## Downloading UniRef30 Database | ||
|
|
||
| Firstly, you must obtain the UniRef database. The database file is approx. 55GB. You may be able to get faster downloads by using `aria2c` with the `-x 8` option. | ||
|
|
||
| ```bash | ||
| wget https://opendata.mmseqs.org/colabfold/uniref30_2302.db.tar.gz | ||
| tar xvf uniref30_2302.db.tar.gz | ||
| ``` | ||
|
|
||
| ## Downloading the Colabfold envdb | ||
|
|
||
| Next, you will need to download the Colabfold envdb. This database is approx. 120GB. | ||
|
|
||
| ```bash | ||
| wget https://opendata.mmseqs.org/colabfold/colabfold_envdb_202108.db.tar.gz | ||
| tar xvf colabfold_envdb_202108.db.tar.gz | ||
|
|
||
| ``` | ||
|
|
||
| ## CPU Database structure | ||
|
|
||
| By now, your directory structure should look something like this | ||
|
|
||
| ``` | ||
| . | ||
| ├── colabfold_envdb | ||
| │ ├── colabfold_envdb_202108_sample_h.tsv | ||
| │ ├── colabfold_envdb_202108_sample_seq.tsv | ||
| │ └── colabfold_envdb_202108_sample.tsv | ||
| └── colabfold_uniref30 | ||
| ├── uniref30_2302_db_seq_h.dbtype | ||
| ├── uniref30_2302_db_seq_h.index | ||
| ├── uniref30_2302_db_seq.index | ||
| └── uniref30_2302_db_taxonomy | ||
| ``` | ||
| ## Create padded database | ||
|
|
||
| Next, we need to create the padded databases. For this, it is recommended to duplicate the databases. | ||
|
|
||
| ```bash | ||
| mkdir colabfold_uniref30_padded | ||
| mmseqs makepaddedseqdb ./colabfold_uniref30/uniref30_2302_db_seq ./colabfold_uniref30_padded/uniref30_2302_db_seq | ||
| mmseqs makepaddedseqdb ./colabfold_uniref30/uniref30_2302_db ./colabfold_uniref30_padded/uniref30_2302_db | ||
| mkdir colabfold_envdb_padded | ||
| mmseqs makepaddedseqdb ./colabfold_envdb/colabfold_envdb_202108_db ./colabfold_envdb_padded/colabfold_envdb_202108_db | ||
| mmseqs makepaddedseqdb ./colabfold_envdb/colabfold_envdb_202108_db_seq ./colabfold_envdb_padded/colabfold_envdb_202108_db_seq | ||
| cp ./colabfold_envdb/colabfold_envdb_202108_db_aln.* ./colabfold_envdb_padded/ | ||
| ``` | ||
|
|
||
| You should now have a directory structure that looks something similar to this | ||
|
|
||
| ``` | ||
| . | ||
| ├── colabfold_envdb | ||
| │ ├── colabfold_envdb_202108_db.0 | ||
| │ ├── colabfold_envdb_202108_db.1 | ||
| │ ├── colabfold_envdb_202108_db.10 | ||
| │ ├── colabfold_envdb_202108_db.11 | ||
| │ ├── colabfold_envdb_202108_db.12 | ||
| │ ├── colabfold_envdb_202108_db.13 | ||
| │ ├── colabfold_envdb_202108_db.14 | ||
| │ ├── colabfold_envdb_202108_db.15 | ||
| │ ├── colabfold_envdb_202108_db.2 | ||
| │ ├── colabfold_envdb_202108_db.3 | ||
| │ ├── colabfold_envdb_202108_db.4 | ||
| │ ├── colabfold_envdb_202108_db.5 | ||
| │ ├── colabfold_envdb_202108_db.6 | ||
| │ ├── colabfold_envdb_202108_db.7 | ||
| │ ├── colabfold_envdb_202108_db.8 | ||
| │ ├── colabfold_envdb_202108_db.9 | ||
| │ ├── colabfold_envdb_202108_db_aln.0 | ||
| │ ├── colabfold_envdb_202108_db_aln.1 | ||
| │ ├── colabfold_envdb_202108_db_aln.10 | ||
| │ ├── colabfold_envdb_202108_db_aln.11 | ||
| │ ├── colabfold_envdb_202108_db_aln.12 | ||
| │ ├── colabfold_envdb_202108_db_aln.13 | ||
| │ ├── colabfold_envdb_202108_db_aln.14 | ||
| │ ├── colabfold_envdb_202108_db_aln.15 | ||
| │ ├── colabfold_envdb_202108_db_aln.2 | ||
| │ ├── colabfold_envdb_202108_db_aln.3 | ||
| │ ├── colabfold_envdb_202108_db_aln.4 | ||
| │ ├── colabfold_envdb_202108_db_aln.5 | ||
| │ ├── colabfold_envdb_202108_db_aln.6 | ||
| │ ├── colabfold_envdb_202108_db_aln.7 | ||
| │ ├── colabfold_envdb_202108_db_aln.8 | ||
| │ ├── colabfold_envdb_202108_db_aln.9 | ||
| │ ├── colabfold_envdb_202108_db_aln.dbtype | ||
| │ ├── colabfold_envdb_202108_db_aln.index | ||
| │ ├── colabfold_envdb_202108_db.dbtype | ||
| │ ├── colabfold_envdb_202108_db_h | ||
| │ ├── colabfold_envdb_202108_db_h.dbtype | ||
| │ ├── colabfold_envdb_202108_db_h.index | ||
| │ ├── colabfold_envdb_202108_db.idx | ||
| │ ├── colabfold_envdb_202108_db.idx.dbtype | ||
| │ ├── colabfold_envdb_202108_db.idx.index | ||
| │ ├── colabfold_envdb_202108_db.index | ||
| │ ├── colabfold_envdb_202108_db_seq.0 | ||
| │ ├── colabfold_envdb_202108_db_seq.1 | ||
| │ ├── colabfold_envdb_202108_db_seq.10 | ||
| │ ├── colabfold_envdb_202108_db_seq.11 | ||
| │ ├── colabfold_envdb_202108_db_seq.12 | ||
| │ ├── colabfold_envdb_202108_db_seq.13 | ||
| │ ├── colabfold_envdb_202108_db_seq.14 | ||
| │ ├── colabfold_envdb_202108_db_seq.15 | ||
| │ ├── colabfold_envdb_202108_db_seq.2 | ||
| │ ├── colabfold_envdb_202108_db_seq.3 | ||
| │ ├── colabfold_envdb_202108_db_seq.4 | ||
| │ ├── colabfold_envdb_202108_db_seq.5 | ||
| │ ├── colabfold_envdb_202108_db_seq.6 | ||
| │ ├── colabfold_envdb_202108_db_seq.7 | ||
| │ ├── colabfold_envdb_202108_db_seq.8 | ||
| │ ├── colabfold_envdb_202108_db_seq.9 | ||
| │ ├── colabfold_envdb_202108_db_seq.dbtype | ||
| │ ├── colabfold_envdb_202108_db_seq_h | ||
| │ ├── colabfold_envdb_202108_db_seq_h.dbtype | ||
| │ ├── colabfold_envdb_202108_db_seq_h.index | ||
| │ ├── colabfold_envdb_202108_db_seq.index | ||
| │ ├── colabfold_envdb_202108_sample_aln.tsv | ||
| │ ├── colabfold_envdb_202108_sample_h.tsv | ||
| │ ├── colabfold_envdb_202108_sample_seq.tsv | ||
| │ └── colabfold_envdb_202108_sample.tsv | ||
| ├── colabfold_envdb_padded | ||
| │ ├── colabfold_envdb_202108_db | ||
| │ ├── colabfold_envdb_202108_db.dbtype | ||
| │ ├── colabfold_envdb_202108_db_h | ||
| │ ├── colabfold_envdb_202108_db_h.dbtype | ||
| │ ├── colabfold_envdb_202108_db_h.index | ||
| │ ├── colabfold_envdb_202108_db.index | ||
| │ ├── colabfold_envdb_202108_db.lookup | ||
| │ ├── colabfold_envdb_202108_db_seq | ||
| │ ├── colabfold_envdb_202108_db_seq.dbtype | ||
| │ ├── colabfold_envdb_202108_db_seq_h | ||
| │ ├── colabfold_envdb_202108_db_seq_h.dbtype | ||
| │ ├── colabfold_envdb_202108_db_seq_h.index | ||
| │ ├── colabfold_envdb_202108_db_seq.index | ||
| │ └── colabfold_envdb_202108_db_seq.lookup | ||
| ├── colabfold_uniref30 | ||
| │ ├── uniref30_2302_db | ||
| │ ├── uniref30_2302_db_aln | ||
| │ ├── uniref30_2302_db_aln.dbtype | ||
| │ ├── uniref30_2302_db_aln.index | ||
| │ ├── uniref30_2302_db.dbtype | ||
| │ ├── uniref30_2302_db.GPU_READY | ||
| │ ├── uniref30_2302_db_h | ||
| │ ├── uniref30_2302_db_h.dbtype | ||
| │ ├── uniref30_2302_db_h.index | ||
| │ ├── uniref30_2302_db.idx | ||
| │ ├── uniref30_2302_db.idx.dbtype | ||
| │ ├── uniref30_2302_db.idx.index | ||
| │ ├── uniref30_2302_db.index | ||
| │ ├── uniref30_2302_db.lookup | ||
| │ ├── uniref30_2302_db_mapping | ||
| │ ├── uniref30_2302_db_seq | ||
| │ ├── uniref30_2302_db_seq.dbtype | ||
| │ ├── uniref30_2302_db_seq_h | ||
| │ ├── uniref30_2302_db_seq_h.dbtype | ||
| │ ├── uniref30_2302_db_seq_h.index | ||
| │ ├── uniref30_2302_db_seq.index | ||
| │ └── uniref30_2302_db_taxonomy | ||
| └── colabfold_uniref30_padded | ||
| ├── uniref30_2302_db | ||
| ├── uniref30_2302_db.dbtype | ||
| ├── uniref30_2302_db_h | ||
| ├── uniref30_2302_db_h.dbtype | ||
| ├── uniref30_2302_db_h.index | ||
| ├── uniref30_2302_db.index | ||
| ├── uniref30_2302_db.lookup | ||
| ├── uniref30_2302_db_seq | ||
| ├── uniref30_2302_db_seq.dbtype | ||
| ├── uniref30_2302_db_seq_h | ||
| ├── uniref30_2302_db_seq_h.dbtype | ||
| ├── uniref30_2302_db_seq_h.index | ||
| ├── uniref30_2302_db_seq.index | ||
| └── uniref30_2302_db_seq.lookup | ||
| ``` | ||
|
|
||
| ## Running colabfold | ||
|
|
||
| You will need to set the `--colabfold_enable_gpu_search true` flag. Below is an example command you can use to run with GPU search enabled: | ||
|
|
||
| ```bash | ||
| nextflow run ./main.nf \ | ||
| --input "samplesheet.csv" \ | ||
| --outdir "output" \ | ||
| --mode "colabfold" \ | ||
| --use_gpu \ | ||
| --db /path/to/db/root \ | ||
| --use_msa_server false \ | ||
| --colabfold_enable_gpu_search true \ | ||
| --colabfold_model_preset alphafold2_ptm | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,12 +3,13 @@ process MMSEQS_COLABFOLDSEARCH { | |
| label 'process_high_memory' | ||
| label 'process_high' | ||
|
|
||
| container "nf-core/proteinfold_mmseqs_colabfoldsearch:2.0.0" | ||
| container "docker.io/nbtmsh/mmseqs_colabfoldsearch:latest" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should definitely place in quay.io/nf-core (or Seqera wave when it gets to it) when
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just leave it as is for now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fine for now. Ping Jose a bit before the merge so the container can go to the standard location |
||
|
|
||
| input: | ||
| tuple val(meta), path(fasta) | ||
| path ('db/*') | ||
| path ('db/*') | ||
| val colabfold_enable_gpu_search | ||
|
|
||
| output: | ||
| tuple val(meta), path("**.a3m"), emit: a3m | ||
|
|
@@ -25,8 +26,13 @@ process MMSEQS_COLABFOLDSEARCH { | |
| def args = task.ext.args ?: '' | ||
|
|
||
| """ | ||
| GPU_ARG="" | ||
| if [ "${colabfold_enable_gpu_search}" == "1" ]; then | ||
| GPU_ARG="--gpu 1" | ||
| fi | ||
| colabfold_search \\ | ||
| $args \\ | ||
| \${GPU_ARG} \\ | ||
| --threads $task.cpus ${fasta} \\ | ||
| ./db \\ | ||
| --af3-json \\ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a quick summary of extensions, something like you must see the
db_hfiles, the.index, etc.Just to quickly highlight what's not in the pre-DLed DBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a bit confused by this. All the DBs shown here have been downloaded form the mmseqs server. This is just copying the unpadded alignment files to the padded database, as these are also needed