Processing Large Data Using the Index

Benchmarks for the blog post Processing Large Data Using the Index.

Compares strategies for syncing a large CSV file with an SQLite database — detecting inserts, updates, and deletes.

There are two benchmark suites: one compares Python sync strategies, the other compares the merge-join algorithm across languages (Python, Node.js, Rust).

Python strategies

Strategy	Approach	Memory
Hash join	Load full CSV into a Python dict, scan DB and probe the dict	O(n)
Merge join	Stream both CSV and DB in sorted order, merge-join	O(1)
SQL join	Load CSV into a temp table, diff with SQL JOINs	O(n) in SQLite
Index lookup	Read CSV sequentially, SELECT each row by primary key	O(n)

Cross-language merge join

The merge-join is also implemented in Node.js (sync_merge_join.mjs) and Rust (rust_merge_join/), benchmarked against the Python version.

Quick start

# Generate test data at multiple sizes (10K–10M rows)
python generate_test_data.py

# Run Python strategy benchmarks
python bench_strategies.py

# Run cross-language merge-join benchmarks
python bench_cross_language.py

# Chart either set of results
python chart_results.py --results bench_strategies_results.json --chart bench_strategies_chart.png
python chart_results.py --results bench_cross_language_results.json --chart bench_cross_language_chart.png

Or use the shell scripts that run all three steps:

./bench_strategies.sh     # Python strategies
./bench_cross_language.sh     # Cross-language merge join

How it works

generate_csv.py creates a CSV with sequential integer id, random name, email, and amount columns.
mutate_csv.py produces a modified copy with random field changes, row deletions, and new row additions.
Each strategy loads the original CSV into SQLite, then compares the mutated CSV against the database, counting inserts, updates, and deletes:
- sync_hash_join.py — loads the entire mutated CSV into a Python dict (hash table), scans the DB, and probes the dict for changes.
- sync_merge_join.py — streams both the sorted CSV and DB cursor in parallel, using a merge join to pair rows by id.
- sync_sql_join.py — loads the mutated CSV into a temporary SQLite table, then uses SQL JOINs to count inserts, updates, and deletes entirely inside the database engine.
- sync_index_lookup.py — reads the CSV sequentially and issues an individual SELECT for each row to probe the database by primary key.
run_benchmarks.py is the core benchmark runner. It runs each strategy as an independent subprocess across multiple data sizes, collecting wall time, CPU time, and peak memory (via /usr/bin/time -v). bench_strategies.py and bench_cross_language.py define which strategies to run and call into run_benchmarks.run_suite().
chart_results.py produces a 3-panel comparison chart (wall time, CPU time, peak RSS) from the results JSON.

Requirements

Python 3.10+ with matplotlib (for charting only — the sync scripts use only the standard library).
Linux /usr/bin/time -v (used for peak memory measurement).
Node.js 22+ (optional, for the JS merge-join benchmark).
Rust / Cargo (optional, for the Rust merge-join benchmark — build with cargo build --release inside rust_merge_join/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processing Large Data Using the Index

Python strategies

Cross-language merge join

Quick start

How it works

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
rust_merge_join		rust_merge_join
.gitignore		.gitignore
README.md		README.md
bench_cross_language.py		bench_cross_language.py
bench_cross_language.sh		bench_cross_language.sh
bench_strategies.py		bench_strategies.py
bench_strategies.sh		bench_strategies.sh
chart_results.py		chart_results.py
common.py		common.py
generate_csv.py		generate_csv.py
generate_test_data.py		generate_test_data.py
mutate_csv.py		mutate_csv.py
run_benchmarks.py		run_benchmarks.py
sync_hash_join.py		sync_hash_join.py
sync_index_lookup.py		sync_index_lookup.py
sync_merge_join.mjs		sync_merge_join.mjs
sync_merge_join.py		sync_merge_join.py
sync_merge_join_heapq.py		sync_merge_join_heapq.py
sync_sql_join.py		sync_sql_join.py

Folders and files

Latest commit

History

Repository files navigation

Processing Large Data Using the Index

Python strategies

Cross-language merge join

Quick start

How it works

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages