diff --git a/README.md b/README.md index fabc4354f..cb92b4b67 100644 --- a/README.md +++ b/README.md @@ -464,6 +464,17 @@ The earth sciences folder contain subfolders for different data formats encounte - Homo_sapiens.GRCh38.111_chr19_22.pc.gtf.gz: chr19+chr22 of Ensembl 111 GTF, subset to `gene_biotype "protein_coding"` with lean attributes - bams/SRX1178088{5,6,7,8}.chr19_22.ds50.bam(+ .bai): 4-sample Ribo-seq cohort (same upstream SRX accessions as the chr20 BAMs above) at 50% downsample, filtered to chr19+chr22 and to reads overlapping protein-coding gene loci. 4 samples is the empirical PRICE-cohort floor; 3 samples crashes its noise-model inference. - README.md: full derivation recipe, empirical justification for the chosen subset/cohort, and the PRICE invocation used for verification. + - rpbp + - reference.annotated.bed.gz: transcript-level annotated BED output by rpbp/preparegenome, for testing rpbp/extractmetageneprofiles + - reference.orfs-genomic.annotated.bed.gz: genomic-coordinate ORF BED output by rpbp/preparegenome, for testing rpbp/extractorfprofiles and rpbp/estimateorfbayesfactors + - reference.orfs-exons.annotated.bed.gz: exon-coordinate ORF BED output by rpbp/preparegenome, for testing rpbp/extractorfprofiles + - SRX11780888_chr20.metagene-profile.csv.gz: metagene profile output by rpbp/extractmetageneprofiles, for testing rpbp/estimatemetagenebayesfactors + - SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz: per-length metagene Bayes-factor table output by rpbp/estimatemetagenebayesfactors, for testing rpbp/selectperiodicoffsets + - SRX11780888_chr20.periodic-offsets.csv.gz: per-length periodic-offset table output by rpbp/selectperiodicoffsets, for testing rpbp/getperiodiclengthsoffsets + - SRX11780888_chr20.periodic_lengths_offsets.tsv: filtered length/offset pairs (lenient thresholds for chr20) output by rpbp/getperiodiclengthsoffsets, for testing rpbp/extractorfprofiles + - SRX11780888_chr20.profiles.mtx.gz: per-ORF Ribo-seq read-count profile matrix output by rpbp/extractorfprofiles, for testing rpbp/estimateorfbayesfactors + - SRX11780888_chr20.bayes-factors.bed.gz: per-ORF Bayes-factor table output by rpbp/estimateorfbayesfactors, for testing rpbp/selectfinalpredictionset + - README.md: per-file derivation recipe. - ribocode - genome_updated.gtf.gz: GTF with gene names updated via ribocode/gtfupdate, compressed for efficient storage - annotation.tar.gz: Tarball containing annotation directory output from ribocode/prepare for testing ribocode/metaplots and ribocode/ribocode modules diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/README.md b/data/genomics/homo_sapiens/riboseq_expression/rpbp/README.md new file mode 100644 index 000000000..e05dff0ea --- /dev/null +++ b/data/genomics/homo_sapiens/riboseq_expression/rpbp/README.md @@ -0,0 +1,56 @@ +# Test data for `rpbp/*` modules + +Per-stage intermediates from a single end-to-end `fasta_gtf_bam_rpbp` subworkflow run on the existing chr20 fixture (`SRX11780888_chr20.bam` + `Homo_sapiens.GRCh38.111_chr20.gtf`). Each fixture is the immediate-upstream input for one rpbp module, so module-level tests can fetch one static file rather than chain six upstream stages. + +## Why these fixtures? + +`modules/nf-core/rpbp/*` test setups used to chain `GUNZIP -> PREPAREGENOME -> EXTRACTMETAGENEPROFILES -> ESTIMATEMETAGENEBAYESFACTORS -> SELECTPERIODICOFFSETS -> GETPERIODICLENGTHSOFFSETS -> EXTRACTORFPROFILES -> ESTIMATEORFBAYESFACTORS` to test downstream modules like `SELECTFINALPREDICTIONSET`. Every chained run cost several minutes of CI time per module. With these fixtures, each module test fetches its one immediate-upstream output and runs in well under a minute. + +The end-to-end integration test still lives in `subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test`, which exercises the full chain. + +## Files + +| File | Size | Description | +|---|---|---| +| `reference.annotated.bed.gz` | <500 KB | Transcript-level annotated BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractmetageneprofiles`. | +| `reference.orfs-genomic.annotated.bed.gz` | <500 KB | Genomic-coordinate ORF BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractorfprofiles` and `rpbp/estimateorfbayesfactors`. | +| `reference.orfs-exons.annotated.bed.gz` | <500 KB | Exon-coordinate ORF BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractorfprofiles`. | +| `SRX11780888_chr20.metagene-profile.csv.gz` | <50 KB | Per-read-length metagene profile output by `rpbp/extractmetageneprofiles`. Consumed by `rpbp/estimatemetagenebayesfactors`. | +| `SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz` | <50 KB | Per-read-length periodicity Bayes-factor table output by `rpbp/estimatemetagenebayesfactors`. Consumed by `rpbp/selectperiodicoffsets`. | +| `SRX11780888_chr20.periodic-offsets.csv.gz` | <50 KB | Per-read-length periodic-offset table output by `rpbp/selectperiodicoffsets`. Consumed by `rpbp/getperiodiclengthsoffsets`. | +| `SRX11780888_chr20.periodic_lengths_offsets.tsv` | <1 KB | Filtered length/offset pairs output by `rpbp/getperiodiclengthsoffsets` using lenient `'10 1 None 0.0'` thresholds (chr20 alone does not pass the rpbp defaults). Consumed by `rpbp/extractorfprofiles`. | +| `SRX11780888_chr20.profiles.mtx.gz` | <2 MB | Per-ORF Ribo-seq read-count profile sparse matrix output by `rpbp/extractorfprofiles`. Consumed by `rpbp/estimateorfbayesfactors`. | +| `SRX11780888_chr20.bayes-factors.bed.gz` | <2 MB | Per-ORF translation-Bayes-factor table output by `rpbp/estimateorfbayesfactors`. Consumed by `rpbp/selectfinalpredictionset`. | + +All files <4 MB. Total set <10 MB. + +## How they were derived + +A single `nf-core subworkflows test fasta_gtf_bam_rpbp` run, with input + +- BAM: `aligned_reads/SRX11780888_chr20.bam` (+ `.bai`) +- FASTA: `Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz` (gunzipped) +- GTF: `Homo_sapiens.GRCh38.111_chr20.gtf` + +from this same `riboseq_expression/` folder, plus the module-level test config: + +``` +process { + withName: 'RPBP_GETPERIODICLENGTHSOFFSETS' { + ext.args = '10 1 None 0.0' + } + withName: 'RPBP_SELECTFINALPREDICTIONSET' { + ext.args = '--select-longest-by-stop --select-best-overlapping' + } +} +``` + +The `RPBP_GETPERIODICLENGTHSOFFSETS` thresholds (`min_metagene_profile_count=10`, `min_metagene_bf_mean=1`, `max_metagene_bf_var=None`, `min_metagene_bf_likelihood=0.0`) are intentionally lenient so chr20 produces non-empty output; rpbp's production defaults are tuned for whole-genome data and reject chr20-only profiles. + +Output files were captured directly from each module's work directory. + +## rpbp version + +`rpbp 4.0.1` (Wave container `community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b`). + +Used by `modules/nf-core/rpbp/*/tests/main.nf.test`. diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.bayes-factors.bed.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.bayes-factors.bed.gz new file mode 100644 index 000000000..e03c0a2ca Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.bayes-factors.bed.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz new file mode 100644 index 000000000..c49264d74 Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-profile.csv.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-profile.csv.gz new file mode 100644 index 000000000..23cb89e59 Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-profile.csv.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz new file mode 100644 index 000000000..1a41d3239 Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic_lengths_offsets.tsv b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic_lengths_offsets.tsv new file mode 100644 index 000000000..1a085cbce --- /dev/null +++ b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic_lengths_offsets.tsv @@ -0,0 +1,10 @@ +length offset +22 11 +24 12 +25 12 +26 12 +27 12 +28 12 +29 12 +30 13 +33 13 diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.profiles.mtx.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.profiles.mtx.gz new file mode 100644 index 000000000..ef4a99bad Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.profiles.mtx.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.annotated.bed.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.annotated.bed.gz new file mode 100644 index 000000000..79c341ee8 Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.annotated.bed.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-exons.annotated.bed.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-exons.annotated.bed.gz new file mode 100644 index 000000000..91ea8dbe4 Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-exons.annotated.bed.gz differ diff --git a/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz new file mode 100644 index 000000000..74b4efa4f Binary files /dev/null and b/data/genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz differ