Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [[#503](https://github.com/nf-core/proteinfold/issues/503)] - Add checkIfExists validation to user-provided database paths across all prepare DB subworkflows.
- [[#507](https://github.com/nf-core/proteinfold/issues/507)] - Implement missing full tests and check that the others work before release 2.0.0.
- [[PR #509](https://github.com/nf-core/proteinfold/pulls/509)] - Setup gpu environment for AWS full tests.
- [[#505](https://github.com/nf-core/proteinfold/issues/505)] - Add Protenix v1 (ByteDance) protein structure prediction mode with GPU support, model weight download via ARIA2, and metrics extraction.

### Parameters

Expand Down Expand Up @@ -165,6 +166,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
| | `--boltz2_mols_link` |
| | `--boltz_model_link` |
| | `--boltz_ccd_link` |
| | `--protenix_db` |
| | `--protenix_model_name` |
| | `--protenix_use_template` |
| | `--protenix_model_link` |
| | `--protenix_ccd_link` |
| | `--protenix_ccd_rdkit_link` |
| | `--protenix_clusters_link` |
| | `--protenix_obsolete_link` |
| | `--protenix_model_path` |
| | `--protenix_ccd_path` |
| | `--protenix_ccd_rdkit_path` |
| | `--protenix_clusters_path` |
| | `--protenix_obsolete_path` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [Protenix](https://github.com/bytedance/protenix)

> ByteDance Research. Protenix: An open-source implementation of AlphaFold3 for protein structure prediction. GitHub. 2024.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ On release, automated continuous integration tests run the pipeline on a full-si

x. [RosettaFold2NA](https://github.com/uw-ipd/RoseTTAFold2NA) - Regular RF2NA

xi. [Protenix](https://github.com/bytedance/protenix) - ByteDance Protenix v1

## Usage

> [!NOTE]
Expand All @@ -66,7 +68,7 @@ nextflow run nf-core/proteinfold \
--outdir <OUTDIR>
```

The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold RoseTTAFold-All-Atom or RosettaFold2NA. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`] or ['--rosettafold_all_atom_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database.
The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold, RoseTTAFold-All-Atom, RosettaFold2NA, Boltz or Protenix. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`], ['--rosettafold_all_atom_db'] or ['--protenix_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database.

- The typical command to run AlphaFold2 mode is shown below:

Expand Down Expand Up @@ -211,6 +213,19 @@ The pipeline takes care of downloading the databases and parameters required by
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

- The protenix mode can be run using the command below:

```console
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode protenix \
--protenix_db <null (default) | PATH> \
--protenix_model_name <protenix_base_default_v1.0.0 (default) | MODEL_NAME> \
--use_gpu <true/false> \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).

Expand All @@ -230,7 +245,7 @@ For details on how to contribute new modes to the pipeline please refer to the [

nf-core/proteinfold was originally written by Athanasios Baltzis ([@athbaltzis](https://github.com/athbaltzis)), Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)), Luisa Santus ([@luisas](https://github.com/luisas)) and Leila Mansouri ([@l-mansouri](https://github.com/l-mansouri)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/) under the umbrella of the [BovReg project](https://www.bovreg.eu/) and Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/).

Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)) and Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal))
Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics ([@interlinetx](https://github.com/interlinetx)), Martin Steinegger ([@martin-steinegger](https://github.com/martin-steinegger)), Raoul J.P. Bonnal ([@rjpbonnal](https://github.com/rjpbonnal)) and Seunghyun Kang ([@nan5895](https://github.com/nan5895))

We would also like to thanks to the AWS Open Data Sponsorship Program for generously providing the resources necessary to host the data utilized in the testing, development, and deployment of nf-core proteinfold.

Expand Down
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
205 changes: 205 additions & 0 deletions bin/fasta_to_protenix_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
#!/usr/bin/env python3
"""
Convert FASTA files to Protenix JSON input format.

Protenix expects a JSON array where each element has:
- "name": job name
- "sequences": list of chain definitions
- "covalent_bonds": []

Optionally includes pre-computed MSA paths (pairedMsaPath, unpairedMsaPath)
for each protein chain when --msa CSV files are provided from SPLIT_MSA.

Usage:
fasta_to_protenix_json.py <FASTA> <ID> -o <OUTPUT_DIR> [--msa file1.csv file2.csv]
"""

import argparse
import csv
import json
import os
import sys


def infer_entity_type(header, sequence):
"""Infer entity type from FASTA header and sequence content."""
header_lower = header.lower()
if "dna" in header_lower:
return "dna"
if "rna" in header_lower:
return "rna"
if "ligand" in header_lower or "smiles" in header_lower:
return "ligand"

seq = sequence.strip().upper()
seq_set = set(seq)
if seq_set <= set("ACUGN") and len(seq) > 1:
return "rna"
if seq_set <= set("ACTGN") and len(seq) > 1:
return "dna"

return "protein"


def parse_fasta(fasta_file):
"""Parse a FASTA file into list of (header, sequence) tuples."""
entries = []
header = None
seq_lines = []

with open(fasta_file, "r") as f:
for line in f:
line = line.strip()
if line.startswith(">"):
if header is not None:
entries.append((header, "".join(seq_lines)))
header = line[1:]
seq_lines = []
elif line:
seq_lines.append(line)

if header is not None:
entries.append((header, "".join(seq_lines)))

return entries


def csv_to_a3m(csv_file, output_dir, chain_idx):
"""Convert MSA CSV (from SPLIT_MSA/msa_manager.py) to paired/unpaired A3M files."""
paired = []
unpaired = []

with open(csv_file, "r") as f:
reader = csv.reader(f)
next(reader) # skip header row (key,sequence)
for row in reader:
key = int(row[0])
seq = row[1]
if key == -1:
unpaired.append(seq)
else:
paired.append(seq)

chain_dir = os.path.join(output_dir, str(chain_idx))
os.makedirs(chain_dir, exist_ok=True)

pairing_path = os.path.join(chain_dir, "pairing.a3m")
non_pairing_path = os.path.join(chain_dir, "non_pairing.a3m")

with open(pairing_path, "w") as f:
for i, seq in enumerate(paired):
f.write(f">paired_{i}\n{seq}\n")

with open(non_pairing_path, "w") as f:
for i, seq in enumerate(unpaired):
f.write(f">unpaired_{i}\n{seq}\n")

return pairing_path, non_pairing_path


def fasta_to_protenix_json(fasta_file, sample_id, msa_files=None, output_dir="."):
"""Convert a FASTA file to Protenix JSON format with optional MSA."""
entries = parse_fasta(fasta_file)

if not entries:
print(f"Error: No sequences found in {fasta_file}", file=sys.stderr)
sys.exit(1)

msa_output_dir = os.path.join(output_dir, "msa_protenix")
os.makedirs(msa_output_dir, exist_ok=True)

sequences = []
protein_idx = 0
unique_proteins = {}
msa_counter = 0

for header, sequence in entries:
entity_type = infer_entity_type(header, sequence)

if entity_type == "protein":
chain_def = {
"proteinChain": {
"sequence": sequence,
"count": 1
}
}
if msa_files:
if sequence not in unique_proteins:
unique_proteins[sequence] = msa_counter
msa_counter += 1
this_msa_idx = unique_proteins[sequence]
if this_msa_idx < len(msa_files):
pairing_path, non_pairing_path = csv_to_a3m(
msa_files[this_msa_idx], msa_output_dir, protein_idx
)
chain_def["proteinChain"]["pairedMsaPath"] = pairing_path
chain_def["proteinChain"]["unpairedMsaPath"] = non_pairing_path
protein_idx += 1
sequences.append(chain_def)
elif entity_type == "dna":
sequences.append({
"dnaSequence": {
"sequence": sequence,
"count": 1
}
})
elif entity_type == "rna":
sequences.append({
"rnaSequence": {
"sequence": sequence,
"count": 1
}
})
elif entity_type == "ligand":
sequences.append({
"ligand": {
"ligand": sequence,
"count": 1
}
})

job = {
"name": sample_id,
"sequences": sequences,
"covalent_bonds": []
}

return [job]


def main():
parser = argparse.ArgumentParser(
description="Convert FASTA to Protenix JSON format"
)
parser.add_argument("FASTA", help="Input FASTA file")
parser.add_argument("ID", help="Sample identifier")
parser.add_argument(
"-o", "--output-dir", default=".",
help="Output directory (default: current dir)"
)
parser.add_argument(
"--msa",
nargs='*',
default=[],
help="MSA CSV files for protein sequences (from SPLIT_MSA)."
)
args = parser.parse_args()

os.makedirs(args.output_dir, exist_ok=True)

json_data = fasta_to_protenix_json(
args.FASTA, args.ID,
msa_files=args.msa if args.msa else None,
output_dir=args.output_dir
)
output_path = os.path.join(args.output_dir, f"{args.ID}.json")

with open(output_path, "w") as f:
json.dump(json_data, f, indent=2)

print(f"Generated: {output_path}")


if __name__ == "__main__":
main()

3 changes: 2 additions & 1 deletion bin/generate_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,8 @@ def pdb_to_lddt(struct_files, generate_tsv):
"rosettafold_all_atom": "RosettaFold All-Atom",
"helixfold3": "HelixFold3",
"rosettafold2na": "RoseTTAFold2NA",
"boltz": "Boltz"
"boltz": "Boltz",
"protenix": "Protenix" # 이 줄을 추가하세요
}

parser = argparse.ArgumentParser()
Expand Down
Loading
Loading