Skip to content

ender672/application-level-merge-join

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Processing Large Data Using the Index

Benchmarks for the blog post Processing Large Data Using the Index.

Compares strategies for syncing a large CSV file with an SQLite database — detecting inserts, updates, and deletes.

There are two benchmark suites: one compares Python sync strategies, the other compares the merge-join algorithm across languages (Python, Node.js, Rust).

Python strategies

Strategy Approach Memory
Hash join Load full CSV into a Python dict, scan DB and probe the dict O(n)
Merge join Stream both CSV and DB in sorted order, merge-join O(1)
SQL join Load CSV into a temp table, diff with SQL JOINs O(n) in SQLite
Index lookup Read CSV sequentially, SELECT each row by primary key O(n)

Cross-language merge join

The merge-join is also implemented in Node.js (sync_merge_join.mjs) and Rust (rust_merge_join/), benchmarked against the Python version.

Quick start

# Generate test data at multiple sizes (10K–10M rows)
python generate_test_data.py

# Run Python strategy benchmarks
python bench_strategies.py

# Run cross-language merge-join benchmarks
python bench_cross_language.py

# Chart either set of results
python chart_results.py --results bench_strategies_results.json --chart bench_strategies_chart.png
python chart_results.py --results bench_cross_language_results.json --chart bench_cross_language_chart.png

Or use the shell scripts that run all three steps:

./bench_strategies.sh     # Python strategies
./bench_cross_language.sh     # Cross-language merge join

How it works

  1. generate_csv.py creates a CSV with sequential integer id, random name, email, and amount columns.

  2. mutate_csv.py produces a modified copy with random field changes, row deletions, and new row additions.

  3. Each strategy loads the original CSV into SQLite, then compares the mutated CSV against the database, counting inserts, updates, and deletes:

    • sync_hash_join.py — loads the entire mutated CSV into a Python dict (hash table), scans the DB, and probes the dict for changes.

    • sync_merge_join.py — streams both the sorted CSV and DB cursor in parallel, using a merge join to pair rows by id.

    • sync_sql_join.py — loads the mutated CSV into a temporary SQLite table, then uses SQL JOINs to count inserts, updates, and deletes entirely inside the database engine.

    • sync_index_lookup.py — reads the CSV sequentially and issues an individual SELECT for each row to probe the database by primary key.

  4. run_benchmarks.py is the core benchmark runner. It runs each strategy as an independent subprocess across multiple data sizes, collecting wall time, CPU time, and peak memory (via /usr/bin/time -v). bench_strategies.py and bench_cross_language.py define which strategies to run and call into run_benchmarks.run_suite().

  5. chart_results.py produces a 3-panel comparison chart (wall time, CPU time, peak RSS) from the results JSON.

Requirements

  • Python 3.10+ with matplotlib (for charting only — the sync scripts use only the standard library).
  • Linux /usr/bin/time -v (used for peak memory measurement).
  • Node.js 22+ (optional, for the JS merge-join benchmark).
  • Rust / Cargo (optional, for the Rust merge-join benchmark — build with cargo build --release inside rust_merge_join/).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors