Skip to content

Lower peak memory usage with sqlite3#21

Open
wbazant wants to merge 4 commits intobiobakery:masterfrom
wbazant:sqlite3-store
Open

Lower peak memory usage with sqlite3#21
wbazant wants to merge 4 commits intobiobakery:masterfrom
wbazant:sqlite3-store

Conversation

@wbazant
Copy link
Copy Markdown

@wbazant wbazant commented Feb 11, 2021

Solves #19

Summary of changes

  • Load in reactions database only at the end as it needs 2GB RAM
  • Reimplement store.Alignments and store.Reads to use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queries
  • nucleotide.py and translated.py gain a tiny bit of store logic for start and end of bulk write so that it can happen within transactions
  • small API change to store.Reads constructor: instead of a fasta in the constructor, it happens when calling store.Reads.add_from_fasta
  • remove the initial read count state from store.Reads
  • update tests: update setups for store modules where API changed, remove unnecessary tests for the custom temp file logic, one assertEquals changed to assertAlmostEquals
  • add unit tests for clearing the store and managing transactions
  • reimplement utilities.double_sort: the previous implementation did not sort equal values alphabetically unlike it promises

The unit tests pass :). I also did a run under a memory profiler, and looked at the memory allocation. The issue no longer appears, and the main Python process does not start to use memory until all computation is done.

I have also compared results of a Nextflow pipeline of five runs. The numerical results are the same, and the pipeline runs better: now when the jobs are not given enough memory for their input size, they take longer instead of failing. I think they're swapping out to disk. I'll attach a report in a comment below, and also two commits for small issues that came up in the run.

Wojtek Bazant added 3 commits February 8, 2021 17:51
- Load in reactions database only at the end as it needs 2GB RAM
- Reimplement `store.Alignments` and `store.Reads` to use sqlite3, a core module, instead of in-memory dicts and lists and a temp file, replace data access code with SQL queries
- nucleotide.py and translated.py gain a tiny bit of store logic for start and end of bulk write so that it can happen within transactions
- small API change to store.Reads constructor: instead of a fasta in the constructor, it happens when calling store.Reads.add_from_fasta
- remove the initial read count state from store.Reads
- update tests: update setups for store modules where API changed, remove unnecessary tests for the custom temp file logic, one assertEquals changed to assertAlmostEquals
- add unit tests for clearing the store and managing transactions
- reimplement utilities.double_sort: the previous implementation did not sort equal values alphabetically unlike it promises
…eads_store a bit earlier - there are deletes to it when calculating blastx coverage
…ognise a common use case and an easier query
@wbazant
Copy link
Copy Markdown
Author

wbazant commented Feb 11, 2021

Here are the results for two runs of a nextflow pipeline, with the previous and new code, on five samples from the Human Microbiome Project.

The previous code is humann from pip as I had it before I started doing this, I think a recent checkout of the master branch. The new code is the first commit on this PR, I installed it as
python3 -m pip install git+https://github.com/wbazant/humann.git@f0ae27738efd7a38a1fc91673f45597b4285042d

The nextflow pipeline runs humann and retries failing jobs, link here: https://github.com/VEuPathDB/humann-nextflow

The numerical results are the same, except rounding - here's a sample to demonstrate.
results-are-the-same.txt

Here are the trace of the pipeline runs, it's a bit detailed but in summary: the old run took six days and had lots of failures, and the new run was better.
pipeline-run-traces.txt

This is a report made from the timestamps of individual jobs.
timestamps-report.txt
The big jobs take longer which I think is attributed to jobs succeeding with less memory by swapping out, except the gene families taking longer because I used a query that turned out to be inefficient - it is fixed in the later commits.

%s/__stateful_ops_since_commit/__stateful_ops_in_bulk_write/g
@wbazant
Copy link
Copy Markdown
Author

wbazant commented Feb 18, 2021

I've done another run, here's an updated timestamps file:
timestamps-report-updated.txt
This was with the last two commits, --block-size 0.8 and memory limit 8GB. It's mostly telling me that those parameters are good but, like, that's probably mostly useful for me.

More generally, I see that with the change, computing gene families is now slower. It is taking up to 5% of the total run time instead of no time, and 88a3afe7 doesn't help. Also, my timestamps are useless for comparing different runs, probably the jobs are running on machines with different specs and getting a totally different share of the CPU on the machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant