feat: ORF-level differential translation#189
Draft
pinin4fjords wants to merge 8 commits into
Draft
Conversation
(Tier 1) Gene-level TE numerator re-aggregation: ORF_TO_GENE_CDS_COUNTS sums ONLY canonical_cds ORFs from the catalogue (via orf_to_gene.tsv + catalogue_tsv classification) back to gene level, replacing the plastid-derived gene-CDS counts before REPLACE_RIBOSEQ_COUNTS_IN_MATRIX. Keeps the gene-level TE clean of uORF / dORF dynamics. (Tier 2) Per-ORF DTE: DTE_COUNTS_PREP joins per-ORF Ribo-seq P-site counts with gene-level Salmon RNA-seq counts via orf_to_gene.tsv. Feeds the existing DESEQ2_DELTATE / ANOTA2SEQ_ANOTA2SEQRUN engines (aliased) at ORF resolution. Gated on --extended_orf_analysis + catalogue exists + --skip_plastid false. Adds --run_dotseq placeholder (no-op until #11742 lands). Row-independence caveat: ORFs sharing a gene-level Salmon denominator are perfectly correlated after the join.
Member
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
… failing The previous strict-overlap check rejected the common riboseq wiring, where the secondary matrix is Salmon's gene-level all-sample quant fed in for its RNA-seq columns but still carries the Ribo-seq columns alongside. Primary's columns are authoritative for the primary role, so drop the overlap from secondary and log it to stderr rather than hard-erroring at runtime. A degenerate-case guard remains: if secondary has zero columns left after the drop, the script exits with a clear "no role-specific samples left" message. This unblocks the existing novel_gtf and stringtie_extended tests (which were failing on 25.04.8 CI at DTE_COUNTS_PREP (allsamples)) as well as the ORF-level dotseq / deltate / anota2seq paths. [skip ci] Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Wire DOTSEQ_DOTSEQ_ORF through DTE_COUNTS_PREP at ORF resolution. Drop the --run_dotseq placeholder; dotseq is now a third value for --translational_efficiency_method. Module installed from nf-core/modules#11742-pending (registered under https://github.com/pinin4fjords/nf-core-modules.git so nf-core lint doesn't hit an interactive prompt under CI's no-TTY shell). Adds withName blocks for the ORF-level DTE chain plus extra_orf_dte_args / extra_dotseq_args params, and brings tests/dotseq.nf.test + snapshot. [skip ci]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the modernisation arc on differential translation by adding two complementary paths:
ORF_TO_GENE_CDS_COUNTSsums onlycanonical_cds-class ORFs from the catalogue back to gene level (viaorf_to_gene.tsv+catalogue_tsv), replacing the plastid-derived gene-CDS Ribo-seq counts beforeREPLACE_RIBOSEQ_COUNTS_IN_MATRIX. Keeps the gene-level TE numerator clean of uORF / dORF dynamics.DTE_COUNTS_PREPjoins the per-ORF Ribo-seq P-site counts (feat: ORF-level P-site quantification — replace gene-level counting with per-ORF counts #166) with the gene-level Salmon RNA-seq counts viaorf_to_gene.tsv(novel intergenic ORFs without a host gene drop out). The matrix feeds the existingDESEQ2_DELTATE/ANOTA2SEQ_ANOTA2SEQRUNengines (aliased as*_ORF) at ORF resolution.Changes
orf_to_gene_cds_counts,dte_counts_prep.--extended_orf_analysis true+ catalogue exists +--skip_plastid false.--run_dotseq(placeholder; no-op until Add dotseq/dotseq modules#11742 lands the DOTSeq module).Row-independence caveat
The per-ORF DTE path joins Ribo-seq ORF counts to a gene-level RNA-seq denominator. ORFs sharing a gene-level denominator are perfectly correlated after the join, so per-ORF test statistics underestimate uncertainty for sibling ORFs. Treat the per-ORF p-values as exploratory; the gene-level TE (Tier 1, with canonical-CDS-only aggregation) is the inference-grade output.
What's deferred
FILTER_COUNTS_CANONICAL(restrict gene-level DTE to canonical gene IDs) is on aggregation but layered in later; if you want it here, follow-up commit.Stacked PR notes
Thirteenth and final in the stack splitting #174. Targets #188 (
feat/166-orf-quantification).Closes #168
🤖 Generated with Claude Code