nf-core · nan5895 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 13, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -120,6 +120,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [[#503](https://github.com/nf-core/proteinfold/issues/503)] - Add checkIfExists validation to user-provided database paths across all prepare DB subworkflows.
 - [[#507](https://github.com/nf-core/proteinfold/issues/507)] - Implement missing full tests and check that the others work before release 2.0.0.
 - [[PR #509](https://github.com/nf-core/proteinfold/pulls/509)] - Setup gpu environment for AWS full tests.
+- [[#505](https://github.com/nf-core/proteinfold/issues/505)] - Add Protenix v1 (ByteDance) protein structure prediction mode with GPU support, model weight download via ARIA2, and metrics extraction.
 
 ### Parameters
 
@@ -165,6 +166,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 |                              | `--boltz2_mols_link`             |
 |                              | `--boltz_model_link`             |
 |                              | `--boltz_ccd_link`               |
+|                              | `--protenix_db`                  |
+|                              | `--protenix_model_name`          |
+|                              | `--protenix_use_template`        |
+|                              | `--protenix_model_link`          |
+|                              | `--protenix_ccd_link`            |
+|                              | `--protenix_ccd_rdkit_link`      |
+|                              | `--protenix_clusters_link`       |
+|                              | `--protenix_obsolete_link`       |
+|                              | `--protenix_model_path`          |
+|                              | `--protenix_ccd_path`            |
+|                              | `--protenix_ccd_rdkit_path`      |
+|                              | `--protenix_clusters_path`       |
+|                              | `--protenix_obsolete_path`       |
 
 > **NB:** Parameter has been **updated** if both old and new parameter information is present.
 > **NB:** Parameter has been **added** if just the new parameter information is present.

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -24,6 +24,10 @@
 
   > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
 
+- [Protenix](https://github.com/bytedance/protenix)
+
+  > ByteDance Research. Protenix: An open-source implementation of AlphaFold3 for protein structure prediction. GitHub. 2024.
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

diff --git a/README.md b/README.md
@@ -52,6 +52,8 @@ On release, automated continuous integration tests run the pipeline on a full-si
 
    x. [RosettaFold2NA](https://github.com/uw-ipd/RoseTTAFold2NA) - Regular RF2NA
 
+   xi. [Protenix](https://github.com/bytedance/protenix) - ByteDance Protenix v1
+
 ## Usage
 
 > [!NOTE]
@@ -66,7 +68,7 @@ nextflow run nf-core/proteinfold \
    --outdir <OUTDIR>
 ```
 
-The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold RoseTTAFold-All-Atom or RosettaFold2NA. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`] or ['--rosettafold_all_atom_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database.
+The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold, RoseTTAFold-All-Atom, RosettaFold2NA, Boltz or Protenix. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`], ['--rosettafold_all_atom_db'] or ['--protenix_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database.
 
 - The typical command to run AlphaFold2 mode is shown below:
 
@@ -211,6 +213,19 @@ The pipeline takes care of downloading the databases and parameters required by
       -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
   ```
 
+- The protenix mode can be run using the command below:
+
+  ```console
+  nextflow run nf-core/proteinfold \
+      --input samplesheet.csv \
+      --outdir <OUTDIR> \
+      --mode protenix \
+      --protenix_db <null (default) | PATH> \
+      --protenix_model_name <protenix_base_default_v1.0.0 (default) | MODEL_NAME> \
+      --use_gpu <true/false> \
+      -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
+  ```
+
 > [!WARNING]
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
 
@@ -230,7 +245,7 @@ For details on how to contribute new modes to the pipeline please refer to the [
 
 nf-core/proteinfold was originally written by Athanasios Baltzis ([@athbaltzis](https://github.com/athbaltzis)), Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)), Luisa Santus ([@luisas](https://github.com/luisas)) and Leila Mansouri ([@l-mansouri](https://github.com/l-mansouri)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/) under the umbrella of the [BovReg project](https://www.bovreg.eu/) and Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/).
 
-Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)) and Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal))
+Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)), Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal)) and Seunghyun Kang ([@nan5895](https://github.com/nan5895))
 
 We would also like to thanks to the AWS Open Data Sponsorship Program for generously providing the resources necessary to host the data utilized in the testing, development, and deployment of nf-core proteinfold.
 

diff --git a/assets/dummy_db_dir/params/clusters-by-entity-40.txt b/assets/dummy_db_dir/params/clusters-by-entity-40.txt
diff --git a/assets/dummy_db_dir/params/components.cif b/assets/dummy_db_dir/params/components.cif
diff --git a/assets/dummy_db_dir/params/components.cif.rdkit_mol.pkl b/assets/dummy_db_dir/params/components.cif.rdkit_mol.pkl
diff --git a/assets/dummy_db_dir/params/obsolete_release_date.csv b/assets/dummy_db_dir/params/obsolete_release_date.csv
diff --git a/assets/dummy_db_dir/params/protenix_base_default_v1.0.0.pt b/assets/dummy_db_dir/params/protenix_base_default_v1.0.0.pt
diff --git a/bin/fasta_to_protenix_json.py b/bin/fasta_to_protenix_json.py
@@ -0,0 +1,205 @@
+#!/usr/bin/env python3
+"""
+Convert FASTA files to Protenix JSON input format.
+
+Protenix expects a JSON array where each element has:
+  - "name": job name
+  - "sequences": list of chain definitions
+  - "covalent_bonds": []
+
+Optionally includes pre-computed MSA paths (pairedMsaPath, unpairedMsaPath)
+for each protein chain when --msa CSV files are provided from SPLIT_MSA.
+
+Usage:
+    fasta_to_protenix_json.py <FASTA> <ID> -o <OUTPUT_DIR> [--msa file1.csv file2.csv]
+"""
+
+import argparse
+import csv
+import json
+import os
+import sys
+
+
+def infer_entity_type(header, sequence):
+    """Infer entity type from FASTA header and sequence content."""
+    header_lower = header.lower()
+    if "dna" in header_lower:
+        return "dna"
+    if "rna" in header_lower:
+        return "rna"
+    if "ligand" in header_lower or "smiles" in header_lower:
+        return "ligand"
+
+    seq = sequence.strip().upper()
+    seq_set = set(seq)
+    if seq_set <= set("ACUGN") and len(seq) > 1:
+        return "rna"
+    if seq_set <= set("ACTGN") and len(seq) > 1:
+        return "dna"
+
+    return "protein"
+
+
+def parse_fasta(fasta_file):
+    """Parse a FASTA file into list of (header, sequence) tuples."""
+    entries = []
+    header = None
+    seq_lines = []
+
+    with open(fasta_file, "r") as f:
+        for line in f:
+            line = line.strip()
+            if line.startswith(">"):
+                if header is not None:
+                    entries.append((header, "".join(seq_lines)))
+                header = line[1:]
+                seq_lines = []
+            elif line:
+                seq_lines.append(line)
+
+    if header is not None:
+        entries.append((header, "".join(seq_lines)))
+
+    return entries
+
+
+def csv_to_a3m(csv_file, output_dir, chain_idx):
+    """Convert MSA CSV (from SPLIT_MSA/msa_manager.py) to paired/unpaired A3M files."""
+    paired = []
+    unpaired = []
+
+    with open(csv_file, "r") as f:
+        reader = csv.reader(f)
+        next(reader)  # skip header row (key,sequence)
+        for row in reader:
+            key = int(row[0])
+            seq = row[1]
+            if key == -1:
+                unpaired.append(seq)
+            else:
+                paired.append(seq)
+
+    chain_dir = os.path.join(output_dir, str(chain_idx))
+    os.makedirs(chain_dir, exist_ok=True)
+
+    pairing_path = os.path.join(chain_dir, "pairing.a3m")
+    non_pairing_path = os.path.join(chain_dir, "non_pairing.a3m")
+
+    with open(pairing_path, "w") as f:
+        for i, seq in enumerate(paired):
+            f.write(f">paired_{i}\n{seq}\n")
+
+    with open(non_pairing_path, "w") as f:
+        for i, seq in enumerate(unpaired):
+            f.write(f">unpaired_{i}\n{seq}\n")
+
+    return pairing_path, non_pairing_path
+
+
+def fasta_to_protenix_json(fasta_file, sample_id, msa_files=None, output_dir="."):
+    """Convert a FASTA file to Protenix JSON format with optional MSA."""
+    entries = parse_fasta(fasta_file)
+
+    if not entries:
+        print(f"Error: No sequences found in {fasta_file}", file=sys.stderr)
+        sys.exit(1)
+
+    msa_output_dir = os.path.join(output_dir, "msa_protenix")
+    os.makedirs(msa_output_dir, exist_ok=True)
+
+    sequences = []
+    protein_idx = 0
+    unique_proteins = {}
+    msa_counter = 0
+
+    for header, sequence in entries:
+        entity_type = infer_entity_type(header, sequence)
+
+        if entity_type == "protein":
+            chain_def = {
+                "proteinChain": {
+                    "sequence": sequence,
+                    "count": 1
+                }
+            }
+            if msa_files:
+                if sequence not in unique_proteins:
+                    unique_proteins[sequence] = msa_counter
+                    msa_counter += 1
+                this_msa_idx = unique_proteins[sequence]
+                if this_msa_idx < len(msa_files):
+                    pairing_path, non_pairing_path = csv_to_a3m(
+                        msa_files[this_msa_idx], msa_output_dir, protein_idx
+                    )
+                    chain_def["proteinChain"]["pairedMsaPath"] = pairing_path
+                    chain_def["proteinChain"]["unpairedMsaPath"] = non_pairing_path
+            protein_idx += 1
+            sequences.append(chain_def)
+        elif entity_type == "dna":
+            sequences.append({
+                "dnaSequence": {
+                    "sequence": sequence,
+                    "count": 1
+                }
+            })
+        elif entity_type == "rna":
+            sequences.append({
+                "rnaSequence": {
+                    "sequence": sequence,
+                    "count": 1
+                }
+            })
+        elif entity_type == "ligand":
+            sequences.append({
+                "ligand": {
+                    "ligand": sequence,
+                    "count": 1
+                }
+            })
+
+    job = {
+        "name": sample_id,
+        "sequences": sequences,
+        "covalent_bonds": []
+    }
+
+    return [job]
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Convert FASTA to Protenix JSON format"
+    )
+    parser.add_argument("FASTA", help="Input FASTA file")
+    parser.add_argument("ID", help="Sample identifier")
+    parser.add_argument(
+        "-o", "--output-dir", default=".",
+        help="Output directory (default: current dir)"
+    )
+    parser.add_argument(
+        "--msa",
+        nargs='*',
+        default=[],
+        help="MSA CSV files for protein sequences (from SPLIT_MSA)."
+    )
+    args = parser.parse_args()
+
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    json_data = fasta_to_protenix_json(
+        args.FASTA, args.ID,
+        msa_files=args.msa if args.msa else None,
+        output_dir=args.output_dir
+    )
+    output_path = os.path.join(args.output_dir, f"{args.ID}.json")
+
+    with open(output_path, "w") as f:
+        json.dump(json_data, f, indent=2)
+
+    print(f"Generated: {output_path}")
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/bin/generate_report.py b/bin/generate_report.py
@@ -379,7 +379,8 @@ def pdb_to_lddt(struct_files, generate_tsv):
     "rosettafold_all_atom": "RosettaFold All-Atom",
     "helixfold3": "HelixFold3",
     "rosettafold2na": "RoseTTAFold2NA",
-    "boltz": "Boltz"
+    "boltz": "Boltz",
+    "protenix": "Protenix"  # 이 줄을 추가하세요
 }
 
 parser = argparse.ArgumentParser()