Lower peak memory usage with sqlite3#21
Conversation
- Load in reactions database only at the end as it needs 2GB RAM - Reimplement `store.Alignments` and `store.Reads` to use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queries - nucleotide.py and translated.py gain a tiny bit of store logic for start and end of bulk write so that it can happen within transactions - small API change to store.Reads constructor: instead of a fasta in the constructor, it happens when calling store.Reads.add_from_fasta - remove the initial read count state from store.Reads - update tests: update setups for store modules where API changed, remove unnecessary tests for the custom temp file logic, one assertEquals changed to assertAlmostEquals - add unit tests for clearing the store and managing transactions - reimplement utilities.double_sort: the previous implementation did not sort equal values alphabetically unlike it promises
…eads_store a bit earlier - there are deletes to it when calculating blastx coverage
…ognise a common use case and an easier query
|
Here are the results for two runs of a nextflow pipeline, with the previous and new code, on five samples from the Human Microbiome Project. The previous code is humann from pip as I had it before I started doing this, I think a recent checkout of the master branch. The new code is the first commit on this PR, I installed it as The nextflow pipeline runs humann and retries failing jobs, link here: https://github.com/VEuPathDB/humann-nextflow The numerical results are the same, except rounding - here's a sample to demonstrate. Here are the trace of the pipeline runs, it's a bit detailed but in summary: the old run took six days and had lots of failures, and the new run was better. This is a report made from the timestamps of individual jobs. |
%s/__stateful_ops_since_commit/__stateful_ops_in_bulk_write/g
|
I've done another run, here's an updated timestamps file: More generally, I see that with the change, computing gene families is now slower. It is taking up to 5% of the total run time instead of no time, and 88a3afe7 doesn't help. Also, my timestamps are useless for comparing different runs, probably the jobs are running on machines with different specs and getting a totally different share of the CPU on the machine. |
Solves #19
Summary of changes
store.Alignmentsandstore.Readsto use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queriesThe unit tests pass :). I also did a run under a memory profiler, and looked at the memory allocation. The issue no longer appears, and the main Python process does not start to use memory until all computation is done.
I have also compared results of a Nextflow pipeline of five runs. The numerical results are the same, and the pipeline runs better: now when the jobs are not given enough memory for their input size, they take longer instead of failing. I think they're swapping out to disk. I'll attach a report in a comment below, and also two commits for small issues that came up in the run.