Lower peak memory usage with sqlite3 by wbazant · Pull Request #21 · biobakery/humann

wbazant · 2021-02-11T00:59:11Z

Solves #19

Summary of changes

Load in reactions database only at the end as it needs 2GB RAM
Reimplement store.Alignments and store.Reads to use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queries
nucleotide.py and translated.py gain a tiny bit of store logic for start and end of bulk write so that it can happen within transactions
small API change to store.Reads constructor: instead of a fasta in the constructor, it happens when calling store.Reads.add_from_fasta
remove the initial read count state from store.Reads
update tests: update setups for store modules where API changed, remove unnecessary tests for the custom temp file logic, one assertEquals changed to assertAlmostEquals
add unit tests for clearing the store and managing transactions
reimplement utilities.double_sort: the previous implementation did not sort equal values alphabetically unlike it promises

The unit tests pass :). I also did a run under a memory profiler, and looked at the memory allocation. The issue no longer appears, and the main Python process does not start to use memory until all computation is done.

I have also compared results of a Nextflow pipeline of five runs. The numerical results are the same, and the pipeline runs better: now when the jobs are not given enough memory for their input size, they take longer instead of failing. I think they're swapping out to disk. I'll attach a report in a comment below, and also two commits for small issues that came up in the run.

- Load in reactions database only at the end as it needs 2GB RAM - Reimplement `store.Alignments` and `store.Reads` to use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queries - nucleotide.py and translated.py gain a tiny bit of store logic for start and end of bulk write so that it can happen within transactions - small API change to store.Reads constructor: instead of a fasta in the constructor, it happens when calling store.Reads.add_from_fasta - remove the initial read count state from store.Reads - update tests: update setups for store modules where API changed, remove unnecessary tests for the custom temp file logic, one assertEquals changed to assertAlmostEquals - add unit tests for clearing the store and managing transactions - reimplement utilities.double_sort: the previous implementation did not sort equal values alphabetically unlike it promises

…eads_store a bit earlier - there are deletes to it when calculating blastx coverage

…ognise a common use case and an easier query

wbazant · 2021-02-11T01:12:46Z

Here are the results for two runs of a nextflow pipeline, with the previous and new code, on five samples from the Human Microbiome Project.

The previous code is humann from pip as I had it before I started doing this, I think a recent checkout of the master branch. The new code is the first commit on this PR, I installed it as
python3 -m pip install git+https://github.com/wbazant/humann.git@f0ae27738efd7a38a1fc91673f45597b4285042d

The nextflow pipeline runs humann and retries failing jobs, link here: https://github.com/VEuPathDB/humann-nextflow

The numerical results are the same, except rounding - here's a sample to demonstrate.
results-are-the-same.txt

Here are the trace of the pipeline runs, it's a bit detailed but in summary: the old run took six days and had lots of failures, and the new run was better.
pipeline-run-traces.txt

This is a report made from the timestamps of individual jobs.
timestamps-report.txt
The big jobs take longer which I think is attributed to jobs succeeding with less memory by swapping out, except the gene families taking longer because I used a query that turned out to be inefficient - it is fixed in the later commits.

%s/__stateful_ops_since_commit/__stateful_ops_in_bulk_write/g

wbazant · 2021-02-18T16:49:56Z

I've done another run, here's an updated timestamps file:
timestamps-report-updated.txt
This was with the last two commits, --block-size 0.8 and memory limit 8GB. It's mostly telling me that those parameters are good but, like, that's probably mostly useful for me.

More generally, I see that with the change, computing gene families is now slower. It is taking up to 5% of the total run time instead of no time, and 88a3afe7 doesn't help. Also, my timestamps are useless for comparing different runs, probably the jobs are running on machines with different specs and getting a totally different share of the CPU on the machine.

Wojtek Bazant added 3 commits February 8, 2021 17:51

Remove now unncessary query_ids. Open the transaction for unaligned_r…

fefec72

…eads_store a bit earlier - there are deletes to it when calculating blastx coverage

Address the regression in speed of the gene families step: try to rec…

88a3afe

…ognise a common use case and an easier query

Rename a variable

76a4f99

%s/__stateful_ops_since_commit/__stateful_ops_in_bulk_write/g

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower peak memory usage with sqlite3#21

Lower peak memory usage with sqlite3#21
wbazant wants to merge 4 commits intobiobakery:masterfrom
wbazant:sqlite3-store

wbazant commented Feb 11, 2021 •

edited

Loading

Uh oh!

wbazant commented Feb 11, 2021 •

edited

Loading

Uh oh!

wbazant commented Feb 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wbazant commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Uh oh!

wbazant commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbazant commented Feb 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wbazant commented Feb 11, 2021 •

edited

Loading

wbazant commented Feb 11, 2021 •

edited

Loading