diff --git a/CHANGELOG.md b/CHANGELOG.md index cd4bf045..f70fce39 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -120,6 +120,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [[#503](https://github.com/nf-core/proteinfold/issues/503)] - Add checkIfExists validation to user-provided database paths across all prepare DB subworkflows. - [[#507](https://github.com/nf-core/proteinfold/issues/507)] - Implement missing full tests and check that the others work before release 2.0.0. - [[PR #509](https://github.com/nf-core/proteinfold/pulls/509)] - Setup gpu environment for AWS full tests. +- [[#505](https://github.com/nf-core/proteinfold/issues/505)] - Add Protenix v1 (ByteDance) protein structure prediction mode with GPU support, model weight download via ARIA2, and metrics extraction. ### Parameters @@ -165,6 +166,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | | `--boltz2_mols_link` | | | `--boltz_model_link` | | | `--boltz_ccd_link` | +| | `--protenix_db` | +| | `--protenix_model_name` | +| | `--protenix_use_template` | +| | `--protenix_model_link` | +| | `--protenix_ccd_link` | +| | `--protenix_ccd_rdkit_link` | +| | `--protenix_clusters_link` | +| | `--protenix_obsolete_link` | +| | `--protenix_model_path` | +| | `--protenix_ccd_path` | +| | `--protenix_ccd_rdkit_path` | +| | `--protenix_clusters_path` | +| | `--protenix_obsolete_path` | > **NB:** Parameter has been **updated** if both old and new parameter information is present. > **NB:** Parameter has been **added** if just the new parameter information is present. diff --git a/CITATIONS.md b/CITATIONS.md index 1b1f9291..e94a7527 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -24,6 +24,10 @@ > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. +- [Protenix](https://github.com/bytedance/protenix) + + > ByteDance Research. Protenix: An open-source implementation of AlphaFold3 for protein structure prediction. GitHub. 2024. + - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. diff --git a/README.md b/README.md index e37b1a88..344baaab 100644 --- a/README.md +++ b/README.md @@ -52,6 +52,8 @@ On release, automated continuous integration tests run the pipeline on a full-si x. [RosettaFold2NA](https://github.com/uw-ipd/RoseTTAFold2NA) - Regular RF2NA + xi. [Protenix](https://github.com/bytedance/protenix) - ByteDance Protenix v1 + ## Usage > [!NOTE] @@ -66,7 +68,7 @@ nextflow run nf-core/proteinfold \ --outdir ``` -The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold RoseTTAFold-All-Atom or RosettaFold2NA. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`] or ['--rosettafold_all_atom_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database. +The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold, RoseTTAFold-All-Atom, RosettaFold2NA, Boltz or Protenix. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`], ['--rosettafold_all_atom_db'] or ['--protenix_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database. - The typical command to run AlphaFold2 mode is shown below: @@ -211,6 +213,19 @@ The pipeline takes care of downloading the databases and parameters required by -profile ``` +- The protenix mode can be run using the command below: + + ```console + nextflow run nf-core/proteinfold \ + --input samplesheet.csv \ + --outdir \ + --mode protenix \ + --protenix_db \ + --protenix_model_name \ + --use_gpu \ + -profile + ``` + > [!WARNING] > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files). @@ -230,7 +245,7 @@ For details on how to contribute new modes to the pipeline please refer to the [ nf-core/proteinfold was originally written by Athanasios Baltzis ([@athbaltzis](https://github.com/athbaltzis)), Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)), Luisa Santus ([@luisas](https://github.com/luisas)) and Leila Mansouri ([@l-mansouri](https://github.com/l-mansouri)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/) under the umbrella of the [BovReg project](https://www.bovreg.eu/) and Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/). -Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)) and Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal)) +Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)), Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal)) and Seunghyun Kang ([@nan5895](https://github.com/nan5895)) We would also like to thanks to the AWS Open Data Sponsorship Program for generously providing the resources necessary to host the data utilized in the testing, development, and deployment of nf-core proteinfold. diff --git a/assets/dummy_db_dir/params/clusters-by-entity-40.txt b/assets/dummy_db_dir/params/clusters-by-entity-40.txt new file mode 100644 index 00000000..e69de29b diff --git a/assets/dummy_db_dir/params/components.cif b/assets/dummy_db_dir/params/components.cif new file mode 100644 index 00000000..e69de29b diff --git a/assets/dummy_db_dir/params/components.cif.rdkit_mol.pkl b/assets/dummy_db_dir/params/components.cif.rdkit_mol.pkl new file mode 100644 index 00000000..e69de29b diff --git a/assets/dummy_db_dir/params/obsolete_release_date.csv b/assets/dummy_db_dir/params/obsolete_release_date.csv new file mode 100644 index 00000000..e69de29b diff --git a/assets/dummy_db_dir/params/protenix_base_default_v1.0.0.pt b/assets/dummy_db_dir/params/protenix_base_default_v1.0.0.pt new file mode 100644 index 00000000..e69de29b diff --git a/bin/fasta_to_protenix_json.py b/bin/fasta_to_protenix_json.py new file mode 100755 index 00000000..9695bbd7 --- /dev/null +++ b/bin/fasta_to_protenix_json.py @@ -0,0 +1,205 @@ +#!/usr/bin/env python3 +""" +Convert FASTA files to Protenix JSON input format. + +Protenix expects a JSON array where each element has: + - "name": job name + - "sequences": list of chain definitions + - "covalent_bonds": [] + +Optionally includes pre-computed MSA paths (pairedMsaPath, unpairedMsaPath) +for each protein chain when --msa CSV files are provided from SPLIT_MSA. + +Usage: + fasta_to_protenix_json.py -o [--msa file1.csv file2.csv] +""" + +import argparse +import csv +import json +import os +import sys + + +def infer_entity_type(header, sequence): + """Infer entity type from FASTA header and sequence content.""" + header_lower = header.lower() + if "dna" in header_lower: + return "dna" + if "rna" in header_lower: + return "rna" + if "ligand" in header_lower or "smiles" in header_lower: + return "ligand" + + seq = sequence.strip().upper() + seq_set = set(seq) + if seq_set <= set("ACUGN") and len(seq) > 1: + return "rna" + if seq_set <= set("ACTGN") and len(seq) > 1: + return "dna" + + return "protein" + + +def parse_fasta(fasta_file): + """Parse a FASTA file into list of (header, sequence) tuples.""" + entries = [] + header = None + seq_lines = [] + + with open(fasta_file, "r") as f: + for line in f: + line = line.strip() + if line.startswith(">"): + if header is not None: + entries.append((header, "".join(seq_lines))) + header = line[1:] + seq_lines = [] + elif line: + seq_lines.append(line) + + if header is not None: + entries.append((header, "".join(seq_lines))) + + return entries + + +def csv_to_a3m(csv_file, output_dir, chain_idx): + """Convert MSA CSV (from SPLIT_MSA/msa_manager.py) to paired/unpaired A3M files.""" + paired = [] + unpaired = [] + + with open(csv_file, "r") as f: + reader = csv.reader(f) + next(reader) # skip header row (key,sequence) + for row in reader: + key = int(row[0]) + seq = row[1] + if key == -1: + unpaired.append(seq) + else: + paired.append(seq) + + chain_dir = os.path.join(output_dir, str(chain_idx)) + os.makedirs(chain_dir, exist_ok=True) + + pairing_path = os.path.join(chain_dir, "pairing.a3m") + non_pairing_path = os.path.join(chain_dir, "non_pairing.a3m") + + with open(pairing_path, "w") as f: + for i, seq in enumerate(paired): + f.write(f">paired_{i}\n{seq}\n") + + with open(non_pairing_path, "w") as f: + for i, seq in enumerate(unpaired): + f.write(f">unpaired_{i}\n{seq}\n") + + return pairing_path, non_pairing_path + + +def fasta_to_protenix_json(fasta_file, sample_id, msa_files=None, output_dir="."): + """Convert a FASTA file to Protenix JSON format with optional MSA.""" + entries = parse_fasta(fasta_file) + + if not entries: + print(f"Error: No sequences found in {fasta_file}", file=sys.stderr) + sys.exit(1) + + msa_output_dir = os.path.join(output_dir, "msa_protenix") + os.makedirs(msa_output_dir, exist_ok=True) + + sequences = [] + protein_idx = 0 + unique_proteins = {} + msa_counter = 0 + + for header, sequence in entries: + entity_type = infer_entity_type(header, sequence) + + if entity_type == "protein": + chain_def = { + "proteinChain": { + "sequence": sequence, + "count": 1 + } + } + if msa_files: + if sequence not in unique_proteins: + unique_proteins[sequence] = msa_counter + msa_counter += 1 + this_msa_idx = unique_proteins[sequence] + if this_msa_idx < len(msa_files): + pairing_path, non_pairing_path = csv_to_a3m( + msa_files[this_msa_idx], msa_output_dir, protein_idx + ) + chain_def["proteinChain"]["pairedMsaPath"] = pairing_path + chain_def["proteinChain"]["unpairedMsaPath"] = non_pairing_path + protein_idx += 1 + sequences.append(chain_def) + elif entity_type == "dna": + sequences.append({ + "dnaSequence": { + "sequence": sequence, + "count": 1 + } + }) + elif entity_type == "rna": + sequences.append({ + "rnaSequence": { + "sequence": sequence, + "count": 1 + } + }) + elif entity_type == "ligand": + sequences.append({ + "ligand": { + "ligand": sequence, + "count": 1 + } + }) + + job = { + "name": sample_id, + "sequences": sequences, + "covalent_bonds": [] + } + + return [job] + + +def main(): + parser = argparse.ArgumentParser( + description="Convert FASTA to Protenix JSON format" + ) + parser.add_argument("FASTA", help="Input FASTA file") + parser.add_argument("ID", help="Sample identifier") + parser.add_argument( + "-o", "--output-dir", default=".", + help="Output directory (default: current dir)" + ) + parser.add_argument( + "--msa", + nargs='*', + default=[], + help="MSA CSV files for protein sequences (from SPLIT_MSA)." + ) + args = parser.parse_args() + + os.makedirs(args.output_dir, exist_ok=True) + + json_data = fasta_to_protenix_json( + args.FASTA, args.ID, + msa_files=args.msa if args.msa else None, + output_dir=args.output_dir + ) + output_path = os.path.join(args.output_dir, f"{args.ID}.json") + + with open(output_path, "w") as f: + json.dump(json_data, f, indent=2) + + print(f"Generated: {output_path}") + + +if __name__ == "__main__": + main() + diff --git a/bin/generate_report.py b/bin/generate_report.py index a2c6e9db..a0602a29 100755 --- a/bin/generate_report.py +++ b/bin/generate_report.py @@ -379,7 +379,8 @@ def pdb_to_lddt(struct_files, generate_tsv): "rosettafold_all_atom": "RosettaFold All-Atom", "helixfold3": "HelixFold3", "rosettafold2na": "RoseTTAFold2NA", - "boltz": "Boltz" + "boltz": "Boltz", + "protenix": "Protenix" # 이 줄을 추가하세요 } parser = argparse.ArgumentParser() diff --git a/bin/protenix_extract_metrics.py b/bin/protenix_extract_metrics.py new file mode 100755 index 00000000..3bdf58fc --- /dev/null +++ b/bin/protenix_extract_metrics.py @@ -0,0 +1,118 @@ +#!/usr/bin/env python3 +import argparse +import csv +import glob +import json +import os +import sys +import numpy as np + +def write_tsv(file_path, rows): + """기존 스크립트와 동일한 방식으로 TSV 작성""" + with open(file_path, 'w', newline='') as out_f: + writer = csv.writer(out_f, delimiter='\t') + writer.writerows(rows) + +def format_pae_rows(pae_data): + """PAE 데이터를 4소수점 문자열로 변환""" + if not pae_data: + return [["0.0000"]] + return [[f"{num:.4f}" for num in row] for row in pae_data] + +def extract_metrics(name, out_dir): + """ + Protenix 출력 폴더에서 데이터를 추출하여 + 기존 nf-core/proteinfold 호환 포맷으로 저장 + """ + # 1. Protenix JSON 파일 찾기 + json_files = sorted( + glob.glob(os.path.join(out_dir, "**", "*_summary_confidence_sample_*.json"), recursive=True) + ) + + if not json_files: + print(f"Warning: No Protenix confidence files found in {out_dir}", file=sys.stderr) + return + + ptm_data = {} + iptm_data = {} + plddt_summary = [] + pae_created = False + + for idx, json_file in enumerate(json_files): + with open(json_file, 'r') as f: + try: + data = json.load(f) + except json.JSONDecodeError: + continue + + model_id = idx # rank_0, rank_1... 형식을 위해 숫자로 관리 + + # pLDDT 추출 (Protenix는 숫자 하나이므로 리스트로 변환하여 저장) + if "plddt" in data: + val = data["plddt"] + plddt_summary.append([f"rank_{idx}", f"{val:.2f}"]) + + # PAE 추출 (있을 경우에만 생성, 없을 경우 나중에 dummy 생성) + if "pae" in data and data["pae"]: + write_tsv(f"{name}_{idx}_pae.tsv", format_pae_rows(data["pae"])) + pae_created = True + + # pTM / iPTM 추출 + if 'ptm' in data and data['ptm'] is not None: + ptm_data[model_id] = f"{np.round(data['ptm'], 3)}" + if 'iptm' in data and data['iptm'] is not None: + iptm_data[model_id] = f"{np.round(data['iptm'], 3)}" + + # --- Nextflow Output Emission을 위한 파일 생성 보장 --- + + # 1. pLDDT (MultiQC용) + if plddt_summary: + # 기존 스타일: [["Model", "pLDDT"], ["rank_0", "82.07"]] + write_tsv(f"{name}_plddt.tsv", [["Positions", "pLDDT"]] + plddt_summary) + else: + write_tsv(f"{name}_plddt.tsv", [["Positions", "pLDDT"]]) + + # 2. pTM & iPTM (기존 스크립트 정렬 방식 유지) + if ptm_data: + ptm_rows = sorted([[k, v] for k, v in ptm_data.items()], key=lambda x: x[0]) + write_tsv(f"{name}_ptm.tsv", ptm_rows) + + if iptm_data: + iptm_rows = sorted([[k, v] for k, v in iptm_data.items()], key=lambda x: x[0]) + write_tsv(f"{name}_iptm.tsv", iptm_rows) + + # 3. PAE (데이터가 없어도 0번 파일은 있어야 에러 안남) + if not pae_created: + write_tsv(f"{name}_0_pae.tsv", [["0.0000"]]) + + # 4. Chainwise pTM/iPTM (Protenix 특화 데이터) + try: + with open(json_files[0], 'r') as f: + data = json.load(f) + + c_iptm, c_ptm = [], [] + if "chain_pair_iptm" in data and isinstance(data["chain_pair_iptm"], list): + matrix = np.array(data["chain_pair_iptm"]) + for i in range(matrix.shape[0]): + for j in range(matrix.shape[1]): + val = f"{matrix[i][j]:.4f}" + if i != j: c_iptm.append(val) + else: c_ptm.append(val) + + # 데이터가 없어도 무조건 파일 생성 (Nextflow 요구사항) + write_tsv(f"{name}_chainwise_ptm.tsv", [c_ptm] if c_ptm else [["0.0000"]]) + write_tsv(f"{name}_chainwise_iptm.tsv", [c_iptm] if c_iptm else [["0.0000"]]) + except: + write_tsv(f"{name}_chainwise_ptm.tsv", [["0.0000"]]) + write_tsv(f"{name}_chainwise_iptm.tsv", [["0.0000"]]) + +def main(): + parser = argparse.ArgumentParser(description="Extract metrics from Protenix output") + parser.add_argument("--name", required=True, help="Sample identifier (meta.id)") + parser.add_argument("--out_dir", required=True, help="Protenix output directory") + args = parser.parse_args() + + extract_metrics(args.name, args.out_dir) + +if __name__ == "__main__": + main() diff --git a/conf/dbs.config b/conf/dbs.config index ff6aad46..095e4913 100644 --- a/conf/dbs.config +++ b/conf/dbs.config @@ -77,6 +77,20 @@ params { boltz2_conf_path = "${params.boltz_db}/params/boltz2_conf.ckpt" boltz2_mols_path = "${params.boltz_db}/params/mols/" + // Protenix links + protenix_model_link = 'https://protenix.tos-cn-beijing.volces.com/checkpoint/protenix_base_default_v1.0.0.pt' + protenix_ccd_link = 'https://protenix.tos-cn-beijing.volces.com/common/components.cif' + protenix_ccd_rdkit_link = 'https://protenix.tos-cn-beijing.volces.com/common/components.cif.rdkit_mol.pkl' + protenix_clusters_link = 'https://protenix.tos-cn-beijing.volces.com/common/clusters-by-entity-40.txt' + protenix_obsolete_link = 'https://protenix.tos-cn-beijing.volces.com/common/obsolete_release_date.csv' + + // Protenix paths + protenix_model_path = "${params.protenix_db}/params/protenix_base_default_v1.0.0.pt" + protenix_ccd_path = "${params.protenix_db}/params/components.cif" + protenix_ccd_rdkit_path = "${params.protenix_db}/params/components.cif.rdkit_mol.pkl" + protenix_clusters_path = "${params.protenix_db}/params/clusters-by-entity-40.txt" + protenix_obsolete_path = "${params.protenix_db}/params/obsolete_release_date.csv" + // Colabfold links colabfold_db_link = 'https://opendata.mmseqs.org/colabfold/colabfold_envdb_202108.db.tar.gz' colabfold_uniref30_link = 'https://opendata.mmseqs.org/colabfold/uniref30_2302.db.tar.gz' diff --git a/conf/modules_protenix.config b/conf/modules_protenix.config new file mode 100644 index 00000000..c835aada --- /dev/null +++ b/conf/modules_protenix.config @@ -0,0 +1,116 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Config file for defining DSL2 per module options and publishing paths +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Available keys to override module options: + ext.args = Additional arguments appended to command in module. + ext.args2 = Second set of arguments appended to command in module (multi-tool modules). + ext.args3 = Third set of arguments appended to command in module (multi-tool modules). + ext.prefix = File name prefix for output files. +---------------------------------------------------------------------------------------- +*/ + +process { + // ColabFold database download configs (shared with other modes) + withName: '.*ARIA2_COLABFOLD_PARAMS:UNTAR' { + ext.prefix = { "${params.colabfold_alphafold2_params_tags[params.colabfold_model_preset] }" } + publishDir = [ + path: {"${params.outdir}/DBs/${params.mode}/params"}, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename }, + ] + } + withName: '.*ARIA2_COLABFOLD_DB:UNTAR' { + ext.prefix = 'colabfold_envdb' + } + withName: '.*PREPARE_COLABFOLD_DBS:ARIA2_UNIREF30:UNTAR' { + ext.prefix = 'colabfold_uniref30' + } + withName: 'ARIA2_PROTENIX_MODEL' { + ext.args = '-o protenix_base_default_v1.0.0.pt' + publishDir = [ + path: { "${params.outdir}/DBs/${params.mode}/params" }, + pattern: 'protenix_base_default_v1.0.0.pt', + ] + } + withName: 'ARIA2_PROTENIX_CCD' { + ext.args = '-o components.cif' + publishDir = [ + path: { "${params.outdir}/DBs/${params.mode}/params" }, + pattern: 'components.cif', + ] + } + withName: 'ARIA2_PROTENIX_CCD_RDKIT' { + ext.args = '-o components.cif.rdkit_mol.pkl' + publishDir = [ + path: { "${params.outdir}/DBs/${params.mode}/params" }, + pattern: 'components.cif.rdkit_mol.pkl', + ] + } + withName: 'ARIA2_PROTENIX_CLUSTERS' { + ext.args = '-o clusters-by-entity-40.txt' + publishDir = [ + path: { "${params.outdir}/DBs/${params.mode}/params" }, + pattern: 'clusters-by-entity-40.txt', + ] + } + withName: 'ARIA2_PROTENIX_OBSOLETE' { + ext.args = '-o obsolete_release_date.csv' + publishDir = [ + path: { "${params.outdir}/DBs/${params.mode}/params" }, + pattern: 'obsolete_release_date.csv', + ] + } + withName: 'RUN_PROTENIX' { + if (params.use_gpu) { + accelerator = 1 + } + ext.args = [ + params.protenix_use_template ? "--use_template true" : "", + "--use_default_params true", + ].findAll { arg -> arg }.join(' ').trim() + + publishDir = [ + [ + path: { "${params.outdir}/protenix/${meta.id}" }, + mode: 'copy', + pattern: '*_plddt.tsv' + ], + [ + path: { "${params.outdir}/protenix/${meta.id}" }, + mode: 'copy', + pattern: '*{ptm,iptm}.tsv' + ], + [ + path: { "${params.outdir}/protenix/${meta.id}/paes" }, + mode: 'copy', + pattern: '*_[0-5]_pae.tsv' + ], + [ + path: { "${params.outdir}/protenix/top_ranked_structures" }, + mode: 'copy', + saveAs: { _filename -> "${meta.id}.pdb" }, + pattern: '*_protenix.pdb' + ], + [ + enabled: params.save_intermediates, + path: { "${params.outdir}/protenix/${meta.id}" }, + mode: 'copy', + pattern: 'protenix_output', + ], + ] + } + + withName: 'PROTENIX_FASTA|MULTIFASTA_TO_CSV|SPLIT_MSA' { + cpus = 1 + memory = 2.GB + time = 1.h + } + + withName: 'NFCORE_PROTEINFOLD:PROTENIX:MULTIQC' { + publishDir = [ + path: { "${params.outdir}/multiqc" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : "protenix_$filename" } + ] + } +} diff --git a/conf/test_full_protenix.config b/conf/test_full_protenix.config new file mode 100644 index 00000000..20e13908 --- /dev/null +++ b/conf/test_full_protenix.config @@ -0,0 +1,23 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running full-size tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a full size pipeline test. + + Use as follows: + nextflow run nf-core/proteinfold -profile test_full_protenix, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Full test profile for protenix' + config_profile_description = 'Minimal test dataset to check pipeline function' + + // Input data for full test of protenix + mode = 'protenix' + colabfold_model_preset = 'alphafold2_ptm' + use_gpu = true + input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv' + colabfold_db = 's3://proteinfold-dataset/test-data/mini_dbs' +} diff --git a/conf/test_protenix.config b/conf/test_protenix.config new file mode 100644 index 00000000..a8e8ba3b --- /dev/null +++ b/conf/test_protenix.config @@ -0,0 +1,39 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running stub tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a stub pipeline test. + + Use as follows: + nextflow run nf-core/proteinfold -profile test_protenix, --outdir + +---------------------------------------------------------------------------------------- +*/ + +stubRun = true + +process { + resourceLimits = [ + cpus: 1, + memory: '2.GB', + time: '1.h' + ] +} + +params { + config_profile_name = 'stub test profile for protenix' + config_profile_description = 'Minimal test dataset to check pipeline function' + + // Input data for stub test of protenix + mode = 'protenix' + colabfold_model_preset = 'alphafold2_ptm' + input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv' + colabfold_db = "${projectDir}/assets/dummy_db_dir" + protenix_db = "${projectDir}/assets/dummy_db_dir" +} + +process { + withName: 'MMSEQS_COLABFOLDSEARCH|RUN_PROTENIX' { + container = 'biocontainers/gawk:5.1.0' + } +} diff --git a/main.nf b/main.nf index 9ddbee2c..c61d238c 100644 --- a/main.nf +++ b/main.nf @@ -21,6 +21,7 @@ include { PREPARE_ESMFOLD_DBS } from './subworkflows/local/prepare_ include { PREPARE_ROSETTAFOLD_ALL_ATOM_DBS } from './subworkflows/local/prepare_rosettafold_all_atom_dbs' include { PREPARE_HELIXFOLD3_DBS } from './subworkflows/local/prepare_helixfold3_dbs' include { PREPARE_BOLTZ_DBS } from './subworkflows/local/prepare_boltz_dbs' +include { PREPARE_PROTENIX_DBS } from './subworkflows/local/prepare_protenix_dbs' include { PREPARE_COLABFOLD_DBS } from './subworkflows/local/prepare_colabfold_dbs' include { PREPARE_ROSETTAFOLD2NA_DBS } from './subworkflows/local/prepare_rosettafold2na_dbs' @@ -31,6 +32,7 @@ include { ESMFOLD } from './workflows/esmfold' include { ROSETTAFOLD_ALL_ATOM } from './workflows/rosettafold_all_atom' include { HELIXFOLD3 } from './workflows/helixfold3' include { BOLTZ } from './workflows/boltz' +include { PROTENIX } from './workflows/protenix' include { ROSETTAFOLD2NA } from './workflows/rosettafold2na' include { PIPELINE_INITIALISATION } from './subworkflows/local/utils_nfcore_proteinfold_pipeline' @@ -560,6 +562,61 @@ workflow NFCORE_PROTEINFOLD { ) ch_top_ranked_model = ch_top_ranked_model.mix(BOLTZ.out.top_ranked_pdb) } + + // + // WORKFLOW: Run Protenix + // + if (params.mode.toLowerCase().split(",").contains("protenix")) { + + PREPARE_PROTENIX_DBS( + params.protenix_db, + params.protenix_model_path, + params.protenix_ccd_path, + params.protenix_ccd_rdkit_path, + params.protenix_clusters_path, + params.protenix_obsolete_path, + params.protenix_model_link, + params.protenix_ccd_link, + params.protenix_ccd_rdkit_link, + params.protenix_clusters_link, + params.protenix_obsolete_link + ) + ch_versions = ch_versions.mix(PREPARE_PROTENIX_DBS.out.versions) + + PREPARE_COLABFOLD_DBS ( + params.colabfold_db, + params.use_msa_server, + params.colabfold_alphafold2_params_path, + params.colabfold_envdb_path, + params.colabfold_uniref30_path, + params.colabfold_alphafold2_params_link, + params.colabfold_db_link, + params.colabfold_uniref30_link, + params.colabfold_create_index + ) + ch_versions = ch_versions.mix(PREPARE_COLABFOLD_DBS.out.versions) + + PROTENIX( + ch_samplesheet, + ch_versions, + PREPARE_PROTENIX_DBS.out.protenix_model, + PREPARE_PROTENIX_DBS.out.protenix_ccd, + PREPARE_PROTENIX_DBS.out.protenix_ccd_rdkit, + PREPARE_PROTENIX_DBS.out.protenix_clusters, + PREPARE_PROTENIX_DBS.out.protenix_obsolete, + PREPARE_COLABFOLD_DBS.out.colabfold_db, + PREPARE_COLABFOLD_DBS.out.uniref30, + params.use_msa_server + ) + ch_multiqc = ch_multiqc.mix(PROTENIX.out.multiqc_report) + ch_versions = ch_versions.mix(PROTENIX.out.versions) + ch_report_input = ch_report_input.mix( + PROTENIX.out.pdb + .combine(ch_dummy_file) + .combine(ch_dummy_file_pae) + ) + ch_top_ranked_model = ch_top_ranked_model.mix(PROTENIX.out.top_ranked_pdb) + } // // POST PROCESSING: generate visualisation reports // diff --git a/modules/local/protenix_fasta/environment.yml b/modules/local/protenix_fasta/environment.yml new file mode 100644 index 00000000..012de092 --- /dev/null +++ b/modules/local/protenix_fasta/environment.yml @@ -0,0 +1,6 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json +channels: + - conda-forge + - bioconda +dependencies: + - conda-forge::python=3.8.3 diff --git a/modules/local/protenix_fasta/main.nf b/modules/local/protenix_fasta/main.nf new file mode 100644 index 00000000..1bc4835f --- /dev/null +++ b/modules/local/protenix_fasta/main.nf @@ -0,0 +1,43 @@ +process PROTENIX_FASTA { + tag "$meta.id" + label 'process_single' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/python:3.8.3' : + 'biocontainers/python:3.8.3' }" + + input: + tuple val(meta), path(fasta), path(msa) + + output: + tuple val(meta), path ("${meta.id}.json"), path("msa_protenix"), emit: protenix_json + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def msa_files = msa ? "--msa " + msa.join(' ') : '' + """ + mkdir -p msa_protenix + fasta_to_protenix_json.py ${fasta} ${meta.id} -o . ${msa_files} + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python3 --version | sed 's/Python //g') + END_VERSIONS + """ + + stub: + """ + mkdir -p msa_protenix + echo '[]' > "${meta.id}.json" + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python3 --version | sed 's/Python //g') + END_VERSIONS + """ +} diff --git a/modules/local/run_protenix/Dockerfile b/modules/local/run_protenix/Dockerfile new file mode 100644 index 00000000..0ee15409 --- /dev/null +++ b/modules/local/run_protenix/Dockerfile @@ -0,0 +1,38 @@ +FROM vemlp-cn-beijing.cr.volces.com/preset-images/pytorch:2.7.1-cu12.6.3-py3.11-ubuntu22.04 + +# Set environment variables +ENV DEBIAN_FRONTEND=noninteractive \ + TZ=Asia/Shanghai \ + PYTHONDONTWRITEBYTECODE=1 \ + PYTHONUNBUFFERED=1 \ + CUTLASS_PATH=/opt/cutlass + +# Install system dependencies +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + git \ + g++ \ + gcc \ + libc6-dev \ + make \ + postgresql \ + hmmer \ + kalign \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + +# Set working directory +WORKDIR /app + +# Install Python dependencies +# Copy requirements.txt first to leverage Docker cache +COPY requirements.txt . +RUN pip3 install --no-cache-dir -r requirements.txt -i https://pypi.org/simple + +# Clone CUTLASS +RUN git clone -b v3.5.1 https://github.com/NVIDIA/cutlass.git /opt/cutlass + +RUN cd /app && git clone https://github.com/bytedance/Protenix.git + +RUN cd /app/Protenix && pip3 install -e . --no-index --find-links=/tmp/existing_package + diff --git a/modules/local/run_protenix/main.nf b/modules/local/run_protenix/main.nf new file mode 100644 index 00000000..6f5430c7 --- /dev/null +++ b/modules/local/run_protenix/main.nf @@ -0,0 +1,114 @@ +/* + * Run Protenix + */ +process RUN_PROTENIX { + tag "$meta.id" + label 'process_medium' + label 'process_gpu' + + container "nan5895/protenix:v1.0.6" + + input: + tuple val(meta), path(input_json) + path (files) + path (model_weights) + path (ccd_components) + path (ccd_rdkit_mol) + path (ccd_clusters) + path (ccd_obsolete) + + output: + tuple val(meta), path ("protenix_output") , optional: true, emit: intermediates + tuple val(meta), path ("protenix_output/*/seed_*/predictions/*_summary_confidence_sample_*.json") , emit: confidence + tuple val(meta), path ("${meta.id}_plddt.tsv") , emit: multiqc + tuple val(meta), path ("${meta.id}_protenix.pdb") , emit: top_ranked_pdb + tuple val(meta), path ("protenix_output/*/seed_*/predictions/*_sample_*.cif") , emit: cif + tuple val(meta), path ("${meta.id}_plddt.tsv") , emit: plddt_raw + tuple val(meta), path ("${meta.id}_*_pae.tsv") , emit: pae_raw + tuple val(meta), path ("${meta.id}_ptm.tsv") , emit: ptm_raw + tuple val(meta), path ("${meta.id}_iptm.tsv") , optional: true, emit: iptm_raw + tuple val(meta), path ("${meta.id}_chainwise_ptm.tsv") , optional: true, emit: summary_chainwise_ptm_raw + tuple val(meta), path ("${meta.id}_chainwise_iptm.tsv") , optional: true, emit: chainwise_iptm_raw + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + // Exit if running this module with -profile conda / -profile mamba + if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) { + error("Local RUN_PROTENIX module does not support Conda. Please use Docker / Singularity / Podman instead.") + } + def args = task.ext.args ?: '' + def model_name = model_weights.baseName + """ + export HOME=/tmp/home + mkdir -p \${HOME} + export CUDA_CACHE_DISABLE=1 + + # Set up Protenix cache directory structure + export PROTENIX_ROOT_DIR=./protenix_cache + mkdir -p \${PROTENIX_ROOT_DIR}/checkpoint + mkdir -p \${PROTENIX_ROOT_DIR}/common + + # Symlink downloaded files into expected locations + ln -sf \$(realpath ${model_weights}) \${PROTENIX_ROOT_DIR}/checkpoint/${model_name}.pt + ln -sf \$(realpath ${ccd_components}) \${PROTENIX_ROOT_DIR}/common/components.cif + ln -sf \$(realpath ${ccd_rdkit_mol}) \${PROTENIX_ROOT_DIR}/common/components.cif.rdkit_mol.pkl + ln -sf \$(realpath ${ccd_clusters}) \${PROTENIX_ROOT_DIR}/common/clusters-by-entity-40.txt + ln -sf \$(realpath ${ccd_obsolete}) \${PROTENIX_ROOT_DIR}/common/obsolete_release_date.csv + + # Run Protenix prediction with JSON input from PROTENIX_FASTA + protenix pred \\ + -i ${input_json} \\ + -o ./protenix_output \\ + -n ${model_name} \\ + -s 101 \\ + ${args} + + # Convert top-ranked CIF (sample_0) to PDB using gemmi + BEST_CIF=\$(ls protenix_output/*/seed_*/predictions/*_sample_0.cif 2>/dev/null | head -1) + if [ -n "\${BEST_CIF}" ]; then + python3 -c " +import gemmi +doc = gemmi.cif.read('\${BEST_CIF}') +st = gemmi.make_structure_from_block(doc[0]) +st.write_pdb('./${meta.id}_protenix.pdb') +" + fi + + # Extract metrics from confidence JSON files + protenix_extract_metrics.py --name ${meta.id} --out_dir ./protenix_output + + + cat <<-EOF > versions.yml + "${task.process}": + protenix: \$(pip list | grep -i protenix | awk '{print \$2}' 2>/dev/null || echo "unknown") +EOF + """ + + stub: + """ + export HOME=/tmp/home + mkdir -p \${HOME} + export CUDA_CACHE_DISABLE=1 + + mkdir -p protenix_output/${meta.id}/seed_101/predictions/ + + touch protenix_output/${meta.id}/seed_101/predictions/${meta.id}_sample_0.cif + touch protenix_output/${meta.id}/seed_101/predictions/${meta.id}_summary_confidence_sample_0.json + + touch "${meta.id}_protenix.pdb" + touch "${meta.id}_plddt.tsv" + touch "${meta.id}_0_pae.tsv" + touch "${meta.id}_ptm.tsv" + touch "${meta.id}_iptm.tsv" + touch "${meta.id}_chainwise_ptm.tsv" + touch "${meta.id}_chainwise_iptm.tsv" + + cat <<-EOF > versions.yml + "${task.process}": + protenix: \$(pip list | grep -i protenix | awk '{print \$2}' 2>/dev/null || echo "unknown") +EOF + """ +} diff --git a/nextflow.config b/nextflow.config index f5908948..bb47f38d 100644 --- a/nextflow.config +++ b/nextflow.config @@ -102,6 +102,25 @@ params { boltz2_conf_path = null boltz2_mols_path = null + // Protenix parameters + protenix_model_name = 'protenix_base_default_v1.0.0' + protenix_use_template = false + + // Protenix links + protenix_model_link = 'https://protenix.tos-cn-beijing.volces.com/checkpoint/protenix_base_default_v1.0.0.pt' + protenix_ccd_link = 'https://protenix.tos-cn-beijing.volces.com/common/components.cif' + protenix_ccd_rdkit_link = 'https://protenix.tos-cn-beijing.volces.com/common/components.cif.rdkit_mol.pkl' + protenix_clusters_link = 'https://protenix.tos-cn-beijing.volces.com/common/clusters-by-entity-40.txt' + protenix_obsolete_link = 'https://protenix.tos-cn-beijing.volces.com/common/obsolete_release_date.csv' + + // Protenix paths + protenix_db = null + protenix_model_path = null + protenix_ccd_path = null + protenix_ccd_rdkit_path = null + protenix_clusters_path = null + protenix_obsolete_path = null + // Colabfold parameters colabfold_model_preset = "alphafold2_ptm" // {'alphafold2_ptm', 'alphafold2_multimer_v1', 'alphafold2_multimer_v2', 'alphafold2_multimer_v3'} colabfold_num_recycles = 3 @@ -398,6 +417,8 @@ profiles { test_rosettafold2na { includeConfig 'conf/test_rosettafold2na.config' } test_full_boltz { includeConfig 'conf/test_full_boltz.config' } test_boltz { includeConfig 'conf/test_boltz.config' } + test_full_protenix { includeConfig 'conf/test_full_protenix.config' } + test_protenix { includeConfig 'conf/test_protenix.config' } } // Load nf-core custom profiles from different institutions @@ -553,6 +574,13 @@ manifest { contribution: ['contributor'], orcid: '0000-0001-6104-9260' ], + [ + name: 'Seunghyun Kang', + affiliation: 'SungKyunKwan University, South Korea', + github: 'nan5895', + contribution: ['contributor'], + orcid: '0009-0002-6481-310X' + ], ] homePage = 'https://github.com/nf-core/proteinfold' description = """Protein 3D structure prediction pipeline""" @@ -580,7 +608,7 @@ params.alphafold2_full_dbs = params.mode.toLowerCase().split(",").contains("alph (params.alphafold2_full_dbs ?: params.full_dbs) : params.alphafold2_full_dbs params.alphafold3_db = params.mode.toLowerCase().split(",").contains("alphafold3") ? (params.alphafold3_db ?: params.db) : params.alphafold3_db -params.colabfold_db = (params.mode.toLowerCase().split(",").contains("colabfold") || params.mode.toLowerCase().split(",").contains("boltz")) ? +params.colabfold_db = (params.mode.toLowerCase().split(",").contains("colabfold") || params.mode.toLowerCase().split(",").contains("boltz") || params.mode.toLowerCase().split(",").contains("protenix")) ? (params.colabfold_db ?: params.db) : params.colabfold_db params.esmfold_db = params.mode.toLowerCase().split(",").contains("esmfold") ? (params.esmfold_db ?: params.db) : params.esmfold_db @@ -594,6 +622,7 @@ params.helixfold3_db = params.mode.toLowerCase().split(",").contains("helixfold3 //params.helixfold3_full_dbs = params.mode.toLowerCase().split(",").contains("helixfold3") ? // (params.helixfold3_full_dbs ?: params.full_dbs) : params.helixfold3_full_dbs params.boltz_db = params.mode.toLowerCase().split(",").contains("boltz") ? (params.boltz_db ?: params.db) : params.boltz_db +params.protenix_db = params.mode.toLowerCase().split(",").contains("protenix") ? (params.protenix_db ?: params.db) : params.protenix_db // Load modules.config for DSL2 module specific options includeConfig 'conf/modules.config' @@ -648,6 +677,13 @@ includeConfig ({ return '/dev/null' }()) +includeConfig ({ + if (params.mode.toLowerCase().split(",").contains("protenix")) { + return 'conf/modules_protenix.config' + } + return '/dev/null' +}()) + includeConfig ({ if (params.mode.toLowerCase().split(",").contains("rosettafold2na")) { return 'conf/modules_rosettafold2na.config' diff --git a/nextflow_schema.json b/nextflow_schema.json index 40b72bdd..5460bf6c 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -34,10 +34,10 @@ "mode": { "type": "string", "default": "alphafold2", - "description": "Specifies the mode in which the pipeline will be run. mode can be any combination of ['alphafold2', 'alphafold3', 'colabfold', 'esmfold', 'rosettafold_all_atom', 'boltz', 'helixfold3', 'rosettafold2na'] separated by a comma (',') with no spaces.", + "description": "Specifies the mode in which the pipeline will be run. mode can be any combination of ['alphafold2', 'alphafold3', 'colabfold', 'esmfold', 'rosettafold_all_atom', 'boltz', 'helixfold3', 'rosettafold2na', 'protenix'] separated by a comma (',') with no spaces.", "fa_icon": "fas fa-cogs", - "pattern": "^(alphafold2|alphafold3|colabfold|esmfold|rosettafold_all_atom|helixfold3|boltz|rosettafold2na|)(,(alphafold2|alphafold3|colabfold|esmfold|rosettafold_all_atom|helixfold3|boltz|rosettafold2na)?,?)*(? it[1] } + ch_versions = ch_versions.mix(ARIA2_PROTENIX_MODEL.out.versions) + + ARIA2_PROTENIX_CCD( + [ + [:], + protenix_ccd_link + ] + ) + ch_protenix_ccd = ARIA2_PROTENIX_CCD.out.downloaded_file.map { it -> it[1] } + ch_versions = ch_versions.mix(ARIA2_PROTENIX_CCD.out.versions) + + ARIA2_PROTENIX_CCD_RDKIT( + [ + [:], + protenix_ccd_rdkit_link + ] + ) + ch_protenix_ccd_rdkit = ARIA2_PROTENIX_CCD_RDKIT.out.downloaded_file.map { it -> it[1] } + ch_versions = ch_versions.mix(ARIA2_PROTENIX_CCD_RDKIT.out.versions) + + ARIA2_PROTENIX_CLUSTERS( + [ + [:], + protenix_clusters_link + ] + ) + ch_protenix_clusters = ARIA2_PROTENIX_CLUSTERS.out.downloaded_file.map { it -> it[1] } + ch_versions = ch_versions.mix(ARIA2_PROTENIX_CLUSTERS.out.versions) + + ARIA2_PROTENIX_OBSOLETE( + [ + [:], + protenix_obsolete_link + ] + ) + ch_protenix_obsolete = ARIA2_PROTENIX_OBSOLETE.out.downloaded_file.map { it -> it[1] } + ch_versions = ch_versions.mix(ARIA2_PROTENIX_OBSOLETE.out.versions) + } + + emit: + protenix_model = ch_protenix_model + protenix_ccd = ch_protenix_ccd + protenix_ccd_rdkit = ch_protenix_ccd_rdkit + protenix_clusters = ch_protenix_clusters + protenix_obsolete = ch_protenix_obsolete + versions = ch_versions +} diff --git a/tests/protenix.nf.test b/tests/protenix.nf.test new file mode 100644 index 00000000..9a950ca0 --- /dev/null +++ b/tests/protenix.nf.test @@ -0,0 +1,38 @@ +nextflow_pipeline { + + name "Test protenix mode stub" + script "../main.nf" + tag "pipeline" + tag "test_protenix" + profile "test_protenix" + + test("-profile test_protenix") { + + when { + params { + outdir = "$outputDir" + } + } + + then { + // stable_name: All files + folders in ${params.outdir}/ with a stable name + def stable_name = getAllFilesFromDir(params.outdir, relative: true, includeDir: true, ignore: ['pipeline_info/*.{html,json,txt}']) + // stable_path: All files in ${params.outdir}/ with stable content + def stable_path = getAllFilesFromDir(params.outdir, ignoreFile: 'tests/.nftignore') + // Early failure no need to test the rest of snapshots + assert workflow.success + assertAll( + { assert snapshot( + // Number of successful tasks + workflow.trace.succeeded().size(), + // pipeline versions.yml file for multiqc from which Nextflow version is removed because we test pipelines on multiple Nextflow versions + removeNextflowVersion("$outputDir/pipeline_info/nf_core_proteinfold_software_mqc_versions.yml"), + // All stable path name, with a relative path + stable_name, + // All files with stable contents + stable_path + ).match() } + ) + } + } +} diff --git a/workflows/protenix.nf b/workflows/protenix.nf new file mode 100644 index 00000000..93082ca8 --- /dev/null +++ b/workflows/protenix.nf @@ -0,0 +1,197 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT LOCAL MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// MODULE: Loaded from modules/local/ +// + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT NF-CORE MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// MODULE: Installed directly from nf-core/modules +// +include { MULTIQC } from '../modules/nf-core/multiqc/main' +include { PROTENIX_FASTA } from '../modules/local/protenix_fasta' +include { SPLIT_MSA } from '../modules/local/split_msa' +include { MMSEQS_COLABFOLDSEARCH } from '../modules/local/mmseqs_colabfoldsearch' +include { MULTIFASTA_TO_CSV } from '../modules/local/multifasta_to_csv' + +// +// SUBWORKFLOW: Consisting entirely of nf-core/modules +// +include { paramsSummaryMap } from 'plugin/nf-schema' +include { paramsSummaryMultiqc } from '../subworkflows/nf-core/utils_nfcore_pipeline' +include { softwareVersionsToYAML } from '../subworkflows/nf-core/utils_nfcore_pipeline' +include { methodsDescriptionText } from '../subworkflows/local/utils_nfcore_proteinfold_pipeline' + +// +// MODULE: Protenix +// +include { RUN_PROTENIX } from '../modules/local/run_protenix' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + RUN MAIN WORKFLOW +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow PROTENIX { + + take: + ch_samplesheet // channel: samplesheet read from --input + ch_versions // channel: [ path(versions.yml) ] + ch_protenix_model // channel: [ path(model_weights) ] + ch_protenix_ccd // channel: [ path(components.cif) ] + ch_protenix_rdkit // channel: [ path(components.cif.rdkit_mol.pkl) ] + ch_protenix_clusters // channel: [ path(clusters-by-entity-40.txt) ] + ch_protenix_obsolete // channel: [ path(obsolete_release_date.csv) ] + ch_colabfold_db // channel: [ path(colabfold_db) ] + ch_uniref30 // channel: [ path(uniref30) ] + msa_server + + main: + ch_samplesheet + .branch { it -> + fasta: it[1].extension == "fasta" || it[1].extension == "fa" + json: it[1].extension == "json" + } + .set { ch_input_by_ext } + + ch_input_by_ext.fasta + .join( + ch_input_by_ext.fasta + .map { meta, file -> + [ + meta, + file.text.findAll { letter -> letter == ">" }.size() + ] + } + ) + .map { it -> + def meta = it[0].clone() + meta.cnt = it[2] + [meta, it[1]] + } + .branch { it -> + multimer: it[0].cnt > 1 + monomer: it[0].cnt == 1 + } + .set{ch_input} + + if (!msa_server){ + MULTIFASTA_TO_CSV( + ch_input.multimer + ) + ch_versions = ch_versions.mix(MULTIFASTA_TO_CSV.out.versions) + + MMSEQS_COLABFOLDSEARCH ( + ch_input.monomer.mix(MULTIFASTA_TO_CSV.out.input_csv), + ch_colabfold_db, + ch_uniref30 + ) + ch_versions = ch_versions.mix(MMSEQS_COLABFOLDSEARCH.out.versions) + + SPLIT_MSA( + MMSEQS_COLABFOLDSEARCH.out.a3m + ) + ch_versions = ch_versions.mix(SPLIT_MSA.out.versions) + ch_input.monomer + .join(SPLIT_MSA.out.msa_csv) + .mix( + ch_input.multimer.join(SPLIT_MSA.out.msa_csv) + ).set{ch_prepare_fasta} + + }else{ + ch_input + .multimer + .mix(ch_input.monomer) + .map { it -> + [it[0], it[1], []] + } + .set{ch_prepare_fasta} + } + + PROTENIX_FASTA( + ch_prepare_fasta + ) + ch_versions = ch_versions.mix(PROTENIX_FASTA.out.versions) + + ch_input_by_ext.json + .map { meta, file -> [ meta, file, [] ] } // already in JSON, no MSA + .mix(PROTENIX_FASTA.out.protenix_json) // newly converted from FASTA + .set { ch_protenix_input } + + RUN_PROTENIX( + ch_protenix_input.map { it -> [it[0], it[1]] }, + ch_protenix_input.map { it -> it[2] }, + ch_protenix_model, + ch_protenix_ccd, + ch_protenix_rdkit, + ch_protenix_clusters, + ch_protenix_obsolete + ) + + RUN_PROTENIX + .out + .cif + .map { it -> + it[0].model = "protenix" + it + } + .set {ch_cif} + + RUN_PROTENIX + .out + .top_ranked_pdb + .map { it -> + it[0].model = "protenix" + it + } + .set {ch_top_ranked_pdb} + + RUN_PROTENIX + .out + .pae_raw + .map { it -> + it[0].model = "protenix" + it + } + .set {ch_pae} + + RUN_PROTENIX + .out + .multiqc + .map { it -> it[1] } + .collect(sort: true) + .map { it -> [ [ "model": "protenix"], it.flatten() ] } + .set { ch_multiqc_report } + + ch_versions = ch_versions.mix(RUN_PROTENIX.out.versions) + + // Wrap top_ranked_pdb as a list to match report_input format [meta, [pdb]] + RUN_PROTENIX + .out + .top_ranked_pdb + .map { meta, pdb -> + def newMeta = meta.clone() + newMeta.model = "protenix" + [ newMeta, [ pdb ] ] + } + .set { ch_pdb } + + emit: + versions = ch_versions + confidence = RUN_PROTENIX.out.confidence + multiqc_report = ch_multiqc_report + top_ranked_pdb = ch_top_ranked_pdb + pdb = ch_pdb + pae = ch_pae + cif = ch_cif +}