Benchmarks for the blog post Processing Large Data Using the Index.
Compares strategies for syncing a large CSV file with an SQLite database — detecting inserts, updates, and deletes.
There are two benchmark suites: one compares Python sync strategies, the other compares the merge-join algorithm across languages (Python, Node.js, Rust).
| Strategy | Approach | Memory |
|---|---|---|
| Hash join | Load full CSV into a Python dict, scan DB and probe the dict | O(n) |
| Merge join | Stream both CSV and DB in sorted order, merge-join | O(1) |
| SQL join | Load CSV into a temp table, diff with SQL JOINs | O(n) in SQLite |
| Index lookup | Read CSV sequentially, SELECT each row by primary key | O(n) |
The merge-join is also implemented in Node.js (sync_merge_join.mjs)
and Rust (rust_merge_join/), benchmarked against the Python version.
# Generate test data at multiple sizes (10K–10M rows)
python generate_test_data.py
# Run Python strategy benchmarks
python bench_strategies.py
# Run cross-language merge-join benchmarks
python bench_cross_language.py
# Chart either set of results
python chart_results.py --results bench_strategies_results.json --chart bench_strategies_chart.png
python chart_results.py --results bench_cross_language_results.json --chart bench_cross_language_chart.pngOr use the shell scripts that run all three steps:
./bench_strategies.sh # Python strategies
./bench_cross_language.sh # Cross-language merge join-
generate_csv.py creates a CSV with sequential integer
id, randomname,email, andamountcolumns. -
mutate_csv.py produces a modified copy with random field changes, row deletions, and new row additions.
-
Each strategy loads the original CSV into SQLite, then compares the mutated CSV against the database, counting inserts, updates, and deletes:
-
sync_hash_join.py — loads the entire mutated CSV into a Python dict (hash table), scans the DB, and probes the dict for changes.
-
sync_merge_join.py — streams both the sorted CSV and DB cursor in parallel, using a merge join to pair rows by id.
-
sync_sql_join.py — loads the mutated CSV into a temporary SQLite table, then uses SQL JOINs to count inserts, updates, and deletes entirely inside the database engine.
-
sync_index_lookup.py — reads the CSV sequentially and issues an individual SELECT for each row to probe the database by primary key.
-
-
run_benchmarks.py is the core benchmark runner. It runs each strategy as an independent subprocess across multiple data sizes, collecting wall time, CPU time, and peak memory (via
/usr/bin/time -v). bench_strategies.py and bench_cross_language.py define which strategies to run and call intorun_benchmarks.run_suite(). -
chart_results.py produces a 3-panel comparison chart (wall time, CPU time, peak RSS) from the results JSON.
- Python 3.10+ with
matplotlib(for charting only — the sync scripts use only the standard library). - Linux
/usr/bin/time -v(used for peak memory measurement). - Node.js 22+ (optional, for the JS merge-join benchmark).
- Rust / Cargo (optional, for the Rust merge-join benchmark — build with
cargo build --releaseinsiderust_merge_join/).