Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,17 @@ The earth sciences folder contain subfolders for different data formats encounte
- Homo_sapiens.GRCh38.111_chr19_22.pc.gtf.gz: chr19+chr22 of Ensembl 111 GTF, subset to `gene_biotype "protein_coding"` with lean attributes
- bams/SRX1178088{5,6,7,8}.chr19_22.ds50.bam(+ .bai): 4-sample Ribo-seq cohort (same upstream SRX accessions as the chr20 BAMs above) at 50% downsample, filtered to chr19+chr22 and to reads overlapping protein-coding gene loci. 4 samples is the empirical PRICE-cohort floor; 3 samples crashes its noise-model inference.
- README.md: full derivation recipe, empirical justification for the chosen subset/cohort, and the PRICE invocation used for verification.
- rpbp
- reference.annotated.bed.gz: transcript-level annotated BED output by rpbp/preparegenome, for testing rpbp/extractmetageneprofiles
- reference.orfs-genomic.annotated.bed.gz: genomic-coordinate ORF BED output by rpbp/preparegenome, for testing rpbp/extractorfprofiles and rpbp/estimateorfbayesfactors
- reference.orfs-exons.annotated.bed.gz: exon-coordinate ORF BED output by rpbp/preparegenome, for testing rpbp/extractorfprofiles
- SRX11780888_chr20.metagene-profile.csv.gz: metagene profile output by rpbp/extractmetageneprofiles, for testing rpbp/estimatemetagenebayesfactors
- SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz: per-length metagene Bayes-factor table output by rpbp/estimatemetagenebayesfactors, for testing rpbp/selectperiodicoffsets
- SRX11780888_chr20.periodic-offsets.csv.gz: per-length periodic-offset table output by rpbp/selectperiodicoffsets, for testing rpbp/getperiodiclengthsoffsets
- SRX11780888_chr20.periodic_lengths_offsets.tsv: filtered length/offset pairs (lenient thresholds for chr20) output by rpbp/getperiodiclengthsoffsets, for testing rpbp/extractorfprofiles
- SRX11780888_chr20.profiles.mtx.gz: per-ORF Ribo-seq read-count profile matrix output by rpbp/extractorfprofiles, for testing rpbp/estimateorfbayesfactors
- SRX11780888_chr20.bayes-factors.bed.gz: per-ORF Bayes-factor table output by rpbp/estimateorfbayesfactors, for testing rpbp/selectfinalpredictionset
- README.md: per-file derivation recipe.
- ribocode
- genome_updated.gtf.gz: GTF with gene names updated via ribocode/gtfupdate, compressed for efficient storage
- annotation.tar.gz: Tarball containing annotation directory output from ribocode/prepare for testing ribocode/metaplots and ribocode/ribocode modules
Expand Down
56 changes: 56 additions & 0 deletions data/genomics/homo_sapiens/riboseq_expression/rpbp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Test data for `rpbp/*` modules

Per-stage intermediates from a single end-to-end `fasta_gtf_bam_rpbp` subworkflow run on the existing chr20 fixture (`SRX11780888_chr20.bam` + `Homo_sapiens.GRCh38.111_chr20.gtf`). Each fixture is the immediate-upstream input for one rpbp module, so module-level tests can fetch one static file rather than chain six upstream stages.

## Why these fixtures?

`modules/nf-core/rpbp/*` test setups used to chain `GUNZIP -> PREPAREGENOME -> EXTRACTMETAGENEPROFILES -> ESTIMATEMETAGENEBAYESFACTORS -> SELECTPERIODICOFFSETS -> GETPERIODICLENGTHSOFFSETS -> EXTRACTORFPROFILES -> ESTIMATEORFBAYESFACTORS` to test downstream modules like `SELECTFINALPREDICTIONSET`. Every chained run cost several minutes of CI time per module. With these fixtures, each module test fetches its one immediate-upstream output and runs in well under a minute.

The end-to-end integration test still lives in `subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test`, which exercises the full chain.

## Files

| File | Size | Description |
|---|---|---|
| `reference.annotated.bed.gz` | <500 KB | Transcript-level annotated BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractmetageneprofiles`. |
| `reference.orfs-genomic.annotated.bed.gz` | <500 KB | Genomic-coordinate ORF BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractorfprofiles` and `rpbp/estimateorfbayesfactors`. |
| `reference.orfs-exons.annotated.bed.gz` | <500 KB | Exon-coordinate ORF BED output by `rpbp/preparegenome`. Consumed by `rpbp/extractorfprofiles`. |
| `SRX11780888_chr20.metagene-profile.csv.gz` | <50 KB | Per-read-length metagene profile output by `rpbp/extractmetageneprofiles`. Consumed by `rpbp/estimatemetagenebayesfactors`. |
| `SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz` | <50 KB | Per-read-length periodicity Bayes-factor table output by `rpbp/estimatemetagenebayesfactors`. Consumed by `rpbp/selectperiodicoffsets`. |
| `SRX11780888_chr20.periodic-offsets.csv.gz` | <50 KB | Per-read-length periodic-offset table output by `rpbp/selectperiodicoffsets`. Consumed by `rpbp/getperiodiclengthsoffsets`. |
| `SRX11780888_chr20.periodic_lengths_offsets.tsv` | <1 KB | Filtered length/offset pairs output by `rpbp/getperiodiclengthsoffsets` using lenient `'10 1 None 0.0'` thresholds (chr20 alone does not pass the rpbp defaults). Consumed by `rpbp/extractorfprofiles`. |
| `SRX11780888_chr20.profiles.mtx.gz` | <2 MB | Per-ORF Ribo-seq read-count profile sparse matrix output by `rpbp/extractorfprofiles`. Consumed by `rpbp/estimateorfbayesfactors`. |
| `SRX11780888_chr20.bayes-factors.bed.gz` | <2 MB | Per-ORF translation-Bayes-factor table output by `rpbp/estimateorfbayesfactors`. Consumed by `rpbp/selectfinalpredictionset`. |

All files <4 MB. Total set <10 MB.

## How they were derived

A single `nf-core subworkflows test fasta_gtf_bam_rpbp` run, with input

- BAM: `aligned_reads/SRX11780888_chr20.bam` (+ `.bai`)
- FASTA: `Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz` (gunzipped)
- GTF: `Homo_sapiens.GRCh38.111_chr20.gtf`

from this same `riboseq_expression/` folder, plus the module-level test config:

```
process {
withName: 'RPBP_GETPERIODICLENGTHSOFFSETS' {
ext.args = '10 1 None 0.0'
}
withName: 'RPBP_SELECTFINALPREDICTIONSET' {
ext.args = '--select-longest-by-stop --select-best-overlapping'
}
}
```

The `RPBP_GETPERIODICLENGTHSOFFSETS` thresholds (`min_metagene_profile_count=10`, `min_metagene_bf_mean=1`, `max_metagene_bf_var=None`, `min_metagene_bf_likelihood=0.0`) are intentionally lenient so chr20 produces non-empty output; rpbp's production defaults are tuned for whole-genome data and reject chr20-only profiles.

Output files were captured directly from each module's work directory.

## rpbp version

`rpbp 4.0.1` (Wave container `community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b`).

Used by `modules/nf-core/rpbp/*/tests/main.nf.test`.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
length offset
22 11
24 12
25 12
26 12
27 12
28 12
29 12
30 13
33 13
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.