Skip to content

Latest commit

 

History

History
1357 lines (1029 loc) · 63.9 KB

File metadata and controls

1357 lines (1029 loc) · 63.9 KB

Architecture Comparison: scanner-rs vs. Competitors

An evidence-based analysis mapping design decisions to hardware performance counters. Every claim links to source code in both scanner-rs and the competitor codebase it is compared against.


1. Executive Summary

scanner-rs is a secret scanner for git repositories and filesystems. It was designed around the CPU — cache hierarchy, branch predictor, TLB, SIMD — rather than around programmer convenience. This report documents the measurable impact of that approach.

Headline results (128-run benchmark, 8 repositories, 2 scan modes, 2 cache states):

  • 1.3–60x faster wall-clock time than all competitors across every test configuration
  • 2.3x faster than Kingfisher (closest Rust competitor) on the representative vscode warm-cache git scan
  • 8–13x faster than TruffleHog and Gitleaks (Go) on the same workload
  • 3.4x fewer CPU cycles, 3.5x fewer instructions, 4.2x fewer branch mispredictions per scan

Honest callouts:

  • 2–3x more RSS memory. Pre-allocated pools, per-worker scratch, and fixed-capacity arenas trade memory for speed. This is deliberate and documented.
  • Different finding counts. scanner-rs currently reports more findings than competitors because it lacks false-positive filters that other scanners ship: entropy gates (applied to the secret span), safelists, and confidence scoring. These are planned additions. Throughput comparisons are unaffected by finding volume.

2. Benchmark Results

2.1 Test Environment

Parameter Value
Machine ARM Graviton3 (aarch64), 16 vCPUs, 61 GiB RAM
L1d/L1i 64 KiB each
L2 1 MiB
L3 32 MiB
Storage EBS-backed NVMe (560 GiB, presents as /dev/nvme0n1 on EC2)
Rust 1.90.0
Go 1.23.3
Runs 128 total (8 repos x 2 modes x 2 cache states x 4 scanners)

2.2 Wall Time

Repo Mode Cache scanner-rs Kingfisher TruffleHog Gitleaks
node git cold 15.7s 1m38s 8m03s 6m48s
node git warm 13.3s 1m17s 7m54s 6m39s
node fs cold 15.2s 20.2s 23.9s 41.1s
node fs warm 1.5s 5.0s 18.6s 38.5s
vscode git cold 16.8s 43.5s 3m08s 2m04s
vscode git warm 13.7s 31.1s 2m59s 1m54s
vscode fs cold 3.0s 6.9s 15.1s 9.6s
vscode fs warm 0.9s 4.6s 13.3s 10.4s
linux git cold 2m51s 7m22s 28m54s 21m20s
linux git warm 2m38s 5m57s 27m57s 20m27s
linux fs cold 28.9s 35.2s 1m02s 1m14s
linux fs warm 2.2s 5.2s 1m02s 1m09s
rocksdb git cold 3.8s 8.6s 36.3s 22.5s
rocksdb git warm 3.1s 7.2s 33.8s 21.0s
rocksdb fs cold 0.8s 7.0s 5.1s 2.8s
rocksdb fs warm 0.7s 6.5s 4.0s 2.8s
tensorflow git cold 25.1s 1m12s 5m49s 3m50s
tensorflow git warm 21.0s 50.7s 5m36s 3m40s
tensorflow fs cold 10.4s 15.7s 21.5s 26.4s
tensorflow fs warm 1.1s 5.3s 18.9s 27.0s
Babylon.js git cold 12.7s 21.2s 2m14s 2m05s
Babylon.js git warm 10.9s 14.6s 2m07s 2m01s
Babylon.js fs cold 1.7s 7.0s 19.1s 17.0s
Babylon.js fs warm 0.8s 4.6s 17.4s 16.4s
gcc git cold 2m12s 5m52s 30m35s 145m06s
gcc git warm 2m25s 4m34s 30m04s 145m21s
gcc fs cold 53.5s 59.3s 1m02s 141m59s
gcc fs warm 2.7s 6.8s 44.7s 142m06s
jdk git cold 22.6s 1m13s 6m15s 5m39s
jdk git warm 19.8s 33.8s 6m04s 5m19s
jdk fs cold 24.9s 32.3s 35.9s 41.1s
jdk fs warm 1.8s 7.9s 19.2s 32.3s

2.3 Speedup Summary (warm git mode, scanner-rs as baseline)

How many times slower each competitor is vs scanner-rs:

Repo vs Kingfisher vs TruffleHog vs Gitleaks
node 5.8x 35.6x 30.0x
vscode 2.3x 13.1x 8.3x
linux 2.3x 10.6x 7.8x
rocksdb 2.3x 10.8x 6.7x
tensorflow 2.4x 16.0x 10.5x
Babylon.js 1.3x 11.6x 11.1x
gcc 1.9x 12.4x 60.0x
jdk 1.7x 18.3x 16.1x

2.4 Throughput

Repo Mode Cache scanner-rs Kingfisher TruffleHog Gitleaks
node git warm 106.1 MiB/s 18.3 MiB/s 3.0 MiB/s 3.5 MiB/s
vscode git warm 84.1 MiB/s 37.0 MiB/s 6.4 MiB/s 10.1 MiB/s
linux git warm 39.0 MiB/s 17.2 MiB/s 3.7 MiB/s 5.0 MiB/s
linux fs warm 3.3 GiB/s 1.4 GiB/s 125.1 MiB/s 111.7 MiB/s
vscode fs warm 1.5 GiB/s 283.7 MiB/s 97.9 MiB/s 125.0 MiB/s
gcc fs warm 1.8 GiB/s 715.7 MiB/s 109.1 MiB/s 0.6 MiB/s

Peak filesystem throughput reaches 3.3 GiB/s on the linux kernel (warm cache).

2.5 Peak Memory Usage

Repo scanner-rs Kingfisher TruffleHog Gitleaks
node 5.5 GiB 2.3 GiB 1.7 GiB 1.6 GiB
vscode 5.4 GiB 2.1 GiB 1.6 GiB 1.3 GiB
linux 22.9 GiB 8.1 GiB 8.3 GiB 7.2 GiB
rocksdb 2.8 GiB 1.6 GiB 403 MiB 403 MiB
tensorflow 7.2 GiB 2.4 GiB 1.8 GiB 1.4 GiB
Babylon.js 4.5 GiB 2.8 GiB 1.5 GiB 1.3 GiB
gcc 15.8 GiB 5.6 GiB 4.8 GiB 4.5 GiB
jdk 6.2 GiB 2.3 GiB 1.8 GiB 1.6 GiB

scanner-rs uses 2–3x more RSS than competitors. This is the cost of pre-allocated pools, per-worker scratch memory, and fixed-capacity arenas. See Section 5 for analysis.


3. CPU-Level Analysis

All measurements on the vscode repository, git mode, warm cache (1.12 GiB scanned).

3.1 Raw Hardware Counters

Metric scanner-rs Kingfisher TruffleHog Gitleaks
Cycles 157,619,084,107 532,509,784,422 678,210,237,914 2,271,669,688,468
Instructions 411,789,892,684 1,426,719,985,424 1,696,903,305,992 10,690,140,886,959
L1D loads 123,210,234,217 452,557,253,455 498,049,596,053 4,452,278,795,146
L1D misses 1,209,415,888 2,447,094,121 5,128,182,208 8,775,249,857
L1I loads 80,809,948,119 330,390,689,117 355,390,878,506 1,523,868,766,465
L1I misses 283,561,600 745,428,529 5,392,181,661 2,888,133,830
L2D refills 328,843,938 497,284,829 2,618,442,067 1,850,602,731
L2D writebacks 713,524,135 1,517,956,160 4,007,442,708 3,145,739,388
Branch predictions 97,941,399,259 299,916,874,655 309,692,700,357 2,621,171,103,819
Branch misses 1,870,945,056 8,238,677,792 7,879,836,883 11,553,796,741
Frontend stalls 11,626,952,340 52,555,189,605 97,980,832,540 90,284,295,847
Backend stalls 64,097,407,942 141,440,571,268 230,734,465,747 663,045,657,519
dTLB loads 126,957,811,310 452,810,087,792 499,657,825,644 4,455,730,287,796
dTLB misses 826,966,710 1,533,221,981 4,008,936,174 4,157,538,969
dTLB walks 78,874,945 111,265,937 461,329,206 283,306,516
iTLB loads 25,600,430,196 143,805,579,347 167,599,942,712 167,910,816,063
iTLB misses 44,955,625 112,654,236 879,805,524 415,683,618

3.2 Derived Metrics

Metric scanner-rs Kingfisher TruffleHog Gitleaks
IPC 2.61 2.68 2.50 4.71
L1D miss rate 0.982% 0.541% 1.03% 0.197%
Branch miss rate 1.91% 2.75% 2.54% 0.441%
Frontend stall % 7.38% 9.87% 14.45% 3.97%
Backend stall % 40.67% 26.56% 34.02% 29.19%
dTLB miss rate 0.651% 0.339% 0.802% 0.093%
Insns/L1D miss 340.49 583.03 330.90 1,218.21
Bytes/insn 0.0029 0.0008 0.0007 0.0001

Reading these metrics correctly: Per-instruction rates (miss rate, IPC) can be misleading across scanners that execute vastly different instruction counts. Gitleaks shows 4.71 IPC and 0.197% L1D miss rate — but it executes 26x more instructions than scanner-rs on the same input. High IPC on wasted work is not an advantage. This report uses absolute counts which directly determine wall-clock time.

3.3 Design Decision Summary

# Design Decision Key Metric scanner-rs Closest Competitor Advantage
1 Vectorscan multi-pattern DFA Total cycles 157,619,084,107 Kingfisher: 532,509,784,422 3.4x fewer
2 Anchor-first scanning Total instructions 411,789,892,684 Kingfisher: 1,426,719,985,424 3.5x fewer
3 Deterministic DFA transitions Branch misses 1,870,945,056 TruffleHog: 7,879,836,883 4.2x fewer
4 Per-worker scratch (no sharing) L2 refills 328,843,938 Kingfisher: 497,284,829 1.5x fewer
5 Compact packed metadata L1D misses 1,209,415,888 Kingfisher: 2,447,094,121 2.0x fewer
6 Pre-allocated fixed-capacity pools dTLB misses 826,966,710 Kingfisher: 1,533,221,981 1.9x fewer
7 Work-stealing + cache locality Backend stall cycles 64,097,407,942 Kingfisher: 141,440,571,268 2.2x fewer
8 Cache-line aligned atomics L2 writebacks 713,524,135 Kingfisher: 1,517,956,160 2.1x fewer
9 I/O hints (fadvise + madvise) FS cold/warm ratio 9.1x avg Kingfisher: 3.8x avg 2.4x larger
10 Custom git object pipeline Git warm speedup 1.3–5.8x vs KF Kingfisher: gix library Additive

4. Evidence-Based Deep-Dive

Each subsection follows the same structure:

  1. What we measured — relevant perf counters
  2. scanner-rs code — the design, with file:line references
  3. Competitor code — the contrasting approach, with file:line references
  4. Why the design difference explains the measured outcome

4.1 Multi-Pattern DFA: Fewer Instructions, Fewer Branch Misses

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
Instructions 411,789,892,684 1,426,719,985,424 1,696,903,305,992 10,690,140,886,959
Branch misses 1,870,945,056 8,238,677,792 7,879,836,883 11,553,796,741
Branch miss rate 1.91% 2.75% 2.54% 0.441%

scanner-rs: Single Vectorscan DFA pass

All ~223 detection rules compile into a single Vectorscan (Hyperscan) multi-pattern database. The DFA scans the input buffer in one pass using SIMD-accelerated state transitions. Each byte advances the automaton state via a table lookup — no per-pattern branching.

src/engine/vectorscan_prefilter.rs:112-135VsPrefilterDb:

pub(crate) struct VsPrefilterDb {
    /// Compiled Vectorscan block-mode database.
    db: *mut vs::hs_database_t,
    /// Number of raw rule patterns in the database.
    raw_rule_count: u32,
    /// Per-raw-pattern metadata (rule id + width + seed radius).
    raw_meta: Vec<RawPatternMeta>,
    /// Rule ids that failed individual compilation (fallback path).
    raw_missing_rules: Vec<u32>,
    /// Pattern id where anchor literals begin (equals `raw_rule_count`).
    anchor_id_base: u32,
    /// Number of anchor literal patterns.
    anchor_pat_count: u32,
    /// Prefix-sum offsets into `anchor_targets`.
    anchor_pat_offsets: Vec<u32>,
    /// Byte length of each anchor pattern.
    anchor_pat_lens: Vec<u32>,
    /// Max bounded width across all rules.
    max_width: u32,
    /// True if any rule reports an unbounded width.
    unbounded: bool,
}

src/engine/vectorscan_prefilter.rs:89-100 — Per-pattern metadata, 12 bytes #[repr(C)]:

#[repr(C)]
#[derive(Clone, Copy, Debug)]
struct RawPatternMeta {
    rule_id: u32,
    match_width: u32,
    seed_radius: u32,
}

// Compile-time size guard: 3 x u32 = 12 bytes, no padding under #[repr(C)].
const _: () = assert!(std::mem::size_of::<RawPatternMeta>() == 12);

src/engine/core.rs:30-44 — Scan algorithm: prefilter seeds windows, regex only runs in hit windows:

// ### Scan phase (`scan_chunk_into`)
//
// Run Vectorscan prefilter on root buffer to populate touched pairs.
// Enqueue `ScanBuf(root)` into the work queue.
// Process work items in FIFO order:
//   - `ScanBuf`: validate regexes in prefilter windows (see
//     `buffer_scan`), then discover transform spans
//     and enqueue `DecodeSpan` items.
//   - `DecodeSpan`: decode the span, then enqueue a `ScanBuf` for the
//     decoded output.
// - Budgets (decode bytes, work items, depth) are enforced per-item so no
//   single input forces unbounded work.

TruffleHog: Aho-Corasick dispatch + per-detector regex

TruffleHog uses an Aho-Corasick automaton to pre-filter, but then dispatches to individual detectors — each running its own regex engine on the matched span. The per-detector dispatch creates O(detectors x spans) regex work.

../trufflehog/pkg/engine/engine.go:798-819:

matchingDetectors := e.AhoCorasickCore.FindDetectorMatches(decoded.Chunk.Data)
if len(matchingDetectors) > 1 && !e.verificationOverlap {
    wgVerificationOverlap.Add(1)
    e.verificationOverlapChunksChan <- verificationOverlapChunk{
        chunk:                       *decoded.Chunk,
        detectors:                   matchingDetectors,
        decoder:                     decoded.DecoderType,
        verificationOverlapWgDoneFn: wgVerificationOverlap.Done,
    }
    continue
}

for _, detector := range matchingDetectors {
    decoded.Chunk.Verify = e.shouldVerifyChunk(sourceVerify, detector, e.detectorVerificationOverrides)
    wgDetect.Add(1)
    e.detectableChunksChan <- detectableChunk{
        chunk:    *decoded.Chunk,
        detector: detector,
        decoder:  decoded.DecoderType,
        wgDoneFn: wgDetect.Done,
    }
}

Each detector in the loop runs its own regex engine internally. This is O(matched_detectors) regex invocations per chunk.

Gitleaks: Sequential rule iteration with per-rule regex

Gitleaks iterates all rules sequentially against each fragment, running Go's regexp package on every matched rule.

../gitleaks/detect/detect.go:327-347:

for _, rule := range d.Config.Rules {
    select {
    case <-ctx.Done():
        break ScanLoop
    default:
        if len(rule.Keywords) == 0 {
            findings = append(findings, d.detectRule(fragment, currentRaw, rule, encodedSegments)...)
            continue
        }

        for _, k := range rule.Keywords {
            if _, ok := keywords[strings.ToLower(k)]; ok {
                findings = append(findings, d.detectRule(fragment, currentRaw, rule, encodedSegments)...)
                break
            }
        }
    }
}

../gitleaks/detect/detect.go:442 — Each detectRule call runs regex:

matches := r.Regex.FindAllStringIndex(currentRaw, -1)

This is O(rules x fragments) regex invocations. Go's regexp uses NFA simulation (no DFA compilation), creating unpredictable branching.

Why this explains the measurements

scanner-rs compiles all patterns into a single DFA with deterministic state transitions (table lookup, no branching per pattern). Competitors dispatch to separate regex engines per rule or per detector, creating:

  • 3.5x more instructions (Kingfisher) to 26x more (Gitleaks): multiple regex engines execute redundant state machine setup
  • 4.2x more branch misses (TruffleHog): the CPU cannot predict which detector will match, causing speculation failures at each dispatch boundary

4.2 Per-Worker Scratch Memory: Lower L2 Cache Refills

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
L2 refills 328,843,938 497,284,829 2,618,442,067 1,850,602,731
L2 miss rate 6.24% 4.27% 9.55% 6.42%
LLC misses 329,841,031 495,869,342 2,592,819,199 1,843,850,762

scanner-rs: Thread-local scratch, no sharing

Each worker thread owns a WorkerCtx containing its own scratch buffers, Vectorscan scratch space, and memory pools — all accessed via Rc (not Arc), never shared across threads.

src/scheduler/executor.rs:472-508WorkerCtx:

pub struct WorkerCtx<T, S> {
    /// Worker ID (0..workers).
    pub worker_id: usize,
    /// User-defined per-worker scratch space.
    pub scratch: S,
    /// Per-worker RNG for randomized stealing.
    pub rng: XorShift64,
    /// Per-worker metrics (no cross-thread contention).
    pub metrics: WorkerMetricsLocal,
    local: Worker<T>,
    // ...
}

src/scratch_memory.rs:43-58ScratchVec: fixed-capacity, page-aligned, never reallocates:

pub struct ScratchVec<T> {
    ptr: NonNull<MaybeUninit<T>>,
    len: u32,
    cap: u32,
}

src/engine/vectorscan_prefilter.rs:229-252VsScratch: per-thread, Send but not Sync:

pub(crate) struct VsScratch {
    /// Opaque Vectorscan scratch handle (must not be shared across threads).
    scratch: *mut vs::hs_scratch_t,
    /// Database this scratch was allocated for (used for binding validation).
    db: *mut vs::hs_database_t,
}

// SAFETY: VsScratch exclusively owns its hs_scratch_t allocation.
// Transfer to another thread is safe; concurrent use is not (we don't impl Sync).
unsafe impl Send for VsScratch {}

TruffleHog: Shared state behind sync.RWMutex

TruffleHog shares metrics state across goroutines behind a sync.RWMutex.

../trufflehog/pkg/engine/engine.go:57-61:

type runtimeMetrics struct {
    mu sync.RWMutex
    Metrics
    detectorAvgTime sync.Map
}

../trufflehog/pkg/engine/engine.go:210 — LRU dedup cache shared across workers:

dedupeCache *lru.Cache[string, detectorspb.DecoderType]

Gitleaks: Mutex-guarded findings slice

../gitleaks/detect/detect.go:71-89:

// commitMutex is to prevent concurrent access to the
// commit map when adding commits
commitMutex *sync.Mutex

// findingMutex is to prevent concurrent access to the
// findings slice when adding findings.
findingMutex *sync.Mutex

// findings is a slice of report.Findings.
findings []report.Finding

Why this explains the measurements

When multiple goroutines contend on shared state (sync.RWMutex, sync.Mutex, shared *lru.Cache), the MOESI/MESI cache coherence protocol must transfer ownership of the contended cache lines between cores. Each transfer triggers an L2 refill as the line is fetched from the remote core's cache. scanner-rs avoids this entirely: each worker's scratch data stays in its own L1/L2 slice with no cross-core invalidation traffic, resulting in 1.5x fewer L2 refills than even Kingfisher (which also uses Rust but relies on Arc<Mutex> for shared stats).


4.3 Cache-Line Aligned Atomics: No False Sharing

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
L2 writebacks 713,524,135 1,517,956,160 4,007,442,708 3,145,739,388
L2 allocations 208,956,309 474,052,300 620,612,044 524,963,070

scanner-rs: #[repr(align(64))] padded counters

src/engine/core.rs:142-167:

/// Cache-line padded atomic counter to reduce false sharing between workers.
///
/// Each instance occupies exactly one 64-byte cache line so that concurrent
/// increments from different threads never contend on the same line.
#[cfg(feature = "stats")]
#[repr(align(64))]
#[derive(Default)]
pub(super) struct CachePaddedAtomicU64(AtomicU64);

// Compile-time size/alignment guard: each counter occupies exactly one cache line.
#[cfg(feature = "stats")]
const _: () = assert!(
    std::mem::align_of::<CachePaddedAtomicU64>() == 64
        && std::mem::size_of::<CachePaddedAtomicU64>() == 64
);

src/engine/core.rs:172-179 — Each counter field is independently padded:

pub(super) struct VectorscanCounters {
    pub(super) scans_attempted: CachePaddedAtomicU64,
    pub(super) scans_ok: CachePaddedAtomicU64,
    pub(super) scans_err: CachePaddedAtomicU64,
    pub(super) utf16_scans_attempted: CachePaddedAtomicU64,
    pub(super) utf16_scans_ok: CachePaddedAtomicU64,
    pub(super) utf16_scans_err: CachePaddedAtomicU64,
    // ...
}

src/scheduler/metrics.rs:1-44 — Worker metrics are also cache-line aligned:

// ## False Sharing Prevention
//
// `WorkerMetricsLocal` is aligned to 64 bytes (cache line size on x86-64).
// When workers store metrics in a contiguous array, this alignment ensures
// each worker's hot counters don't share cache lines with adjacent workers.

src/engine/scratch.rs:384-395 — Even struct layout uses cacheline boundaries:

/// Zero-sized alignment marker that forces a 64-byte cache-line boundary
/// between the hot and cold regions of `ScanScratch`.
#[repr(align(64))]
struct CachelineBoundary {
    _pad: [u8; 0],
}

TruffleHog: Packed atomic fields

In Go, atomic.Int64 fields are typically packed together in structs. When multiple goroutines increment adjacent counters, the 8-byte atomics share 64-byte cache lines, causing false-sharing invalidations on every store. The Go compiler does not provide #[repr(align)] or equivalent padding control.

Why this explains the measurements

False sharing causes L2 writebacks to spike: when one core modifies a cache line that another core also holds, the MOESI protocol forces a writeback of the invalidated line. scanner-rs eliminates this by ensuring each atomic counter occupies its own 64-byte cache line, verified at compile time. The result: 2.1x fewer L2 writebacks than the closest competitor.


4.4 Pre-Allocated Fixed-Capacity Pools: Lower dTLB Misses

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
dTLB misses 826,966,710 1,533,221,981 4,008,936,174 4,157,538,969
dTLB miss rate 0.651% 0.339% 0.802% 0.093%
dTLB walks 78,874,945 111,265,937 461,329,206 283,306,516

scanner-rs: Everything pre-allocated at startup

src/scratch_memory.rs:43-127ScratchVec: page-aligned, fixed capacity, never grows:

/// Fixed-capacity scratch vector backed by page-aligned storage.
///
/// This is a `Vec`-like API with a hard capacity. It never reallocates, so
/// once constructed it is safe to use in hot loops without risking
/// allocations.
pub struct ScratchVec<T> {
    ptr: NonNull<MaybeUninit<T>>,
    len: u32,
    cap: u32,
}

impl<T> ScratchVec<T> {
    pub fn with_capacity(cap: usize) -> Result<Self, ScratchMemoryError> {
        // ...
        // Page alignment keeps allocations predictable and makes it safe to
        // reuse scratch buffers for SIMD-friendly workloads.
        let align = PAGE_SIZE_MIN.max(align_of::<T>());
        let layout = Layout::from_size_align(size, align)
            .map_err(|_| ScratchMemoryError::InvalidLayout)?;
        let raw = unsafe { alloc(layout) };
        // ...
    }
}

src/pool/node_pool.rs:44-114 — Contiguous arena with bitset free-list, O(1) allocate/free:

/// Pre-allocated node pool backed by a contiguous buffer and bitset.
///
/// The bitset tracks free slots (set bit = available), enabling O(1)
/// first-fit allocation via "find first set".
pub struct NodePoolType<const NODE_SIZE: usize, const NODE_ALIGNMENT: usize> {
    buffer: NonNull<u8>,
    len: usize,
    free: DynamicBitSet,
}

impl<...> NodePoolType<...> {
    pub fn init(node_count: u32) -> Self {
        // All memory allocated upfront
        let size = NODE_SIZE.checked_mul(node_count as usize)
            .expect("node buffer size overflow");
        let layout = Layout::from_size_align(size, NODE_ALIGNMENT)...;
        let raw = unsafe { alloc(layout) };
        // ...
    }

    pub fn acquire(&mut self) -> NonNull<u8> {
        let node_index = Self::find_first_set(&self.free)
            .unwrap_or_else(|| panic!("node pool exhausted"));
        self.free.unset(node_index);
        unsafe { NonNull::new_unchecked(self.buffer.as_ptr().add(offset)) }
    }
}

src/runtime.rs:570-704BufferPoolInner: Rc+UnsafeCell, single-threaded, fixed capacity:

struct BufferPoolInner {
    pool: UnsafeCell<NodePoolType<BUFFER_LEN_MAX, BUFFER_ALIGN>>,
    available: Cell<u32>,
    capacity: u32,
}

pub struct BufferPool(Rc<BufferPoolInner>);

impl BufferPool {
    pub fn new(capacity: usize) -> Self {
        let pool = NodePoolType::<BUFFER_LEN_MAX, BUFFER_ALIGN>::init(capacity as u32);
        Self(Rc::new(BufferPoolInner {
            pool: UnsafeCell::new(pool),
            available: Cell::new(capacity as u32),
            capacity: capacity as u32,
        }))
    }
}

src/scheduler/alloc.rs:1-44AllocGuard enforces zero-allocation hot paths:

//! Allocation tracking for detecting hot-path allocations.
//!
//! This module provides:
//! - Global allocation counting (allocs, deallocs, reallocs, bytes)
//! - `AllocGuard` for asserting regions are allocation-free
//! - Snapshot-based delta measurement
//!
//! ```rust,ignore
//! let guard = AllocGuard::new();
//! // ... hot path code ...
//! guard.assert_no_alloc(); // Panics if any allocations occurred
//! ```

Kingfisher: Standard Vec + Arc<Mutex> stats

../kingfisher/src/matcher.rs:255-282:

let raw_matches_scratch = Vec::new();
let user_data = UserData { raw_matches_scratch, input_len: 0 };

Kingfisher's raw_matches_scratch uses a standard Vec that grows dynamically via push(). Each reallocation copies to a new virtual address, creating new page mappings.

../kingfisher/src/matcher.rs:226-233 — Stats behind Arc<Mutex>:

impl<'a> Drop for Matcher<'a> {
    fn drop(&mut self) {
        if let Some(global_stats) = self.global_stats {
            let mut global_stats = global_stats.lock().unwrap();
            global_stats.update(&self.local_stats);
        }
    }
}

Go: GC relocation + append() reallocation

Go's garbage collector relocates objects during GC cycles, invalidating TLB entries for the old pages. append() on slices triggers reallocation when capacity is exceeded, copying data to new virtual addresses. Both patterns fragment the virtual address space.

Why this explains the measurements

scanner-rs pre-allocates all major data structures once at startup:

  • ScratchVec: page-aligned, fixed capacity — same pages reused every scan
  • NodePoolType: single contiguous buffer — one allocation, stable addresses
  • BufferPool: fixed-size chunk buffers — never reallocated

The TLB entries for these pages stay warm throughout the scan. Competitors dynamically grow collections and (in Go's case) have the GC relocate objects, creating new page mappings that must be resolved through expensive TLB walks. Result: 1.9x fewer dTLB misses than Kingfisher, 4.8x fewer than TruffleHog.


4.5 Compact Packed Metadata: Better L1 Cache Density

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
L1D misses 1,209,415,888 2,447,094,121 5,128,182,208 8,775,249,857
Insns/L1D miss 340.49 583.03 330.90 1,218.21
L1D miss rate 0.982% 0.541% 1.03% 0.197%

scanner-rs: 4-byte and 12-byte packed structs

src/engine/hit_pool.rs:82-101PairMeta: 4 bytes, 16 pairs per cache line:

/// Per-pair hot metadata, collocated for single-load access.
///
/// Packing `len` and `coalesced` into 4 bytes means a single 32-bit load
/// gives both fields. 16 consecutive pairs fit in one cache line.
#[derive(Clone, Copy)]
#[repr(C)]
struct PairMeta {
    len: u16,
    coalesced: u8,
    _pad: u8,
}

const _: () = assert!(std::mem::size_of::<PairMeta>() == 4);

src/engine/vectorscan_prefilter.rs:89-100RawPatternMeta: 12 bytes, 5 per cache line:

#[repr(C)]
#[derive(Clone, Copy, Debug)]
struct RawPatternMeta {
    rule_id: u32,
    match_width: u32,
    seed_radius: u32,
}

const _: () = assert!(std::mem::size_of::<RawPatternMeta>() == 12);

src/engine/scratch.rs:48-70DedupKey: 32 bytes aligned to AEGIS-128L absorption rate:

/// Packed dedup key for finding deduplication.
///
/// Uses `#[repr(C)]` with `bytemuck::Pod` to guarantee a fixed 32-byte layout
/// aligned to the AEGIS-128L absorption rate (32 bytes = 2 x 128-bit AES
/// blocks) with no padding.
#[repr(C)]
#[derive(Clone, Copy, bytemuck::Pod, bytemuck::Zeroable)]
struct DedupKey {
    file_id: u32,
    rule_id_with_variant: u32,
    span_start: u32,
    span_end: u32,
    root_hint_start: u64,
    root_hint_end: u64,
}

const _: () = assert!(std::mem::size_of::<DedupKey>() == 32);

Every hot-path struct has #[repr(C)] and a compile-time size assertion.

Go: 16-byte interface headers + pointer chasing

In Go, each interface value carries a 16-byte header (type pointer + data pointer). A list of 223 regexp.Regexp detector interfaces occupies ~3.5 KiB of headers alone — over 50 cache lines — before any pattern data is touched. Each pattern access requires pointer chasing through the interface header to the underlying data.

Why this explains the measurements

scanner-rs packs 223 rules of pattern metadata into 223 x 12 = 2,676 bytes (~42 cache lines) with guaranteed sequential layout. The equivalent Go interface slice requires 50+ cache lines of headers plus pointer-chased data. The compact layout means scanner-rs touches fewer cache lines per rule lookup, yielding 2.0x fewer L1D misses than Kingfisher and 4.2x fewer than TruffleHog.


4.6 Work-Stealing Scheduler: Lower Backend Stalls

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
Backend stalls 64,097,407,942 141,440,571,268 230,734,465,747 663,045,657,519
Backend stall % 40.67% 26.56% 34.02% 29.19%
Frontend stalls 11,626,952,340 52,555,189,605 97,980,832,540 90,284,295,847
Frontend stall % 7.38% 9.87% 14.45% 3.97%

Note on stall rates: scanner-rs shows a higher backend stall percentage (40.67% vs 26–34% for competitors). Since scanner-rs executes 3.4x fewer total cycles, the backend stall percentage is amplified — a larger share of a smaller denominator. The absolute backend stall count (64B cycles) is still 2.2x lower than Kingfisher and 3.6x lower than TruffleHog.

scanner-rs: Chase-Lev deques + tiered idle + cache locality

src/scheduler/executor.rs:3-54 — Architecture:

//!                    ┌─────────────────────────────────────────┐
//!                    │              Executor                    │
//!                    │                                         │
//!  External ────────┼──► Injector ───┬────────────────────────┤
//!  Producers        │   (Crossbeam)  │                         │
//!                    │                ▼                         │
//!                    │   ┌─────────────────────────────────┐   │
//!                    │   │ Worker 0  │ Worker 1  │ Worker N│   │
//!                    │   │ ┌──────┐  │ ┌──────┐  │ ┌──────┐│   │
//!                    │   │ │Deque │◄─┼─►│Deque │◄─┼─►│Deque ││   │
//!                    │   │ │(LIFO)│  │ │(LIFO)│  │ │(LIFO)││   │
//!                    │   │ └──┬───┘  │ └──┬───┘  │ └──┬───┘│   │
//!                    │   │ ┌──▼────┐ │ ┌──▼────┐ │ ┌──▼────┐│   │
//!                    │   │ │Worker │ │ │Worker │ │ │Worker ││   │
//!                    │   │ │Ctx    │ │ │Ctx    │ │ │Ctx    ││   │
//!                    │   │ │+scratch│ │ │+scratch│ │ │+scratch││   │
//!                    │   └─────────┴───────────┴───────────┘   │
//!                    └─────────────────────────────────────────┘

src/scheduler/executor.rs:74-142ExecutorConfig:

pub struct ExecutorConfig {
    pub workers: usize,
    pub seed: u64,
    pub steal_tries: u32,
    pub spin_iters: u32,
    pub park_timeout: Duration,
    pub pin_threads: bool,
}

impl Default for ExecutorConfig {
    fn default() -> Self {
        Self {
            workers: 1,
            seed: 0x853c49e6748fea9b,
            steal_tries: 4,
            spin_iters: 200,
            park_timeout: Duration::from_micros(200),
            pin_threads: super::affinity::default_pin_threads(),
        }
    }
}

Key design points:

  • LIFO local push/pop maximizes temporal locality (just-spawned work reuses warm cache)
  • FIFO steal from remote workers takes the oldest work (cooled data, reduces contention)
  • Randomized steal-victim selection avoids correlated contention
  • Tiered idle: spin (200 iters) -> yield -> park (200us timeout)

TruffleHog: Goroutine pools with channel dispatch

../trufflehog/pkg/engine/engine.go:676-703:

func (e *Engine) startDetectorWorkers(ctx context.Context) {
    numWorkers := e.concurrency * e.detectorWorkerMultiplier

    for worker := 0; worker < numWorkers; worker++ {
        e.wgDetectorWorkers.Add(1)
        go func() {
            ctx := context.WithValue(ctx, "detector_worker_id", common.RandomID(5))
            defer common.Recover(ctx)
            defer e.wgDetectorWorkers.Done()
            e.detectorWorker(ctx)
        }()
    }
}

Workers consume from shared channels. Go's scheduler may migrate goroutines between OS threads, causing unpredictable cache invalidation.

Gitleaks: Semaphore-bounded goroutines

../gitleaks/detect/detect.go:99-130:

Sema *semgroup.Group

// ...
Sema: semgroup.NewGroup(ctx, 40),

Gitleaks limits concurrency to 40 goroutines via a semaphore group. There is no work-stealing — each goroutine processes its assigned fragment independently. No locality optimization.

Kingfisher: Tokio runtime

../kingfisher/src/main.rs:111-117:

let runtime = Builder::new_multi_thread()
    .worker_threads(num_jobs)
    .enable_all()
    .build()
    .context("Failed to create Tokio runtime")?;

Kingfisher uses Tokio's multi-threaded runtime. While Tokio does have work-stealing, it is optimized for async I/O workloads, not CPU-bound scanning. The async overhead (future state machines, waker registration) adds instruction count for compute-only tasks.

Why this explains the measurements

scanner-rs's LIFO-local scheduling keeps recently spawned tasks on the same core where their input data is still in L1/L2 cache. Competitors either use Go's runtime (goroutine migration between OS threads) or Tokio (async overhead for CPU-bound work). In absolute terms: 2.2x fewer backend stall cycles than Kingfisher, 3.6x fewer than TruffleHog, 10.3x fewer than Gitleaks.


4.7 Anchor-First Scanning: Fewer Total Instructions

What we measured:

Metric scanner-rs Kingfisher TruffleHog Gitleaks
Instructions 411,789,892,684 1,426,719,985,424 1,696,903,305,992 10,690,140,886,959
vs scanner-rs 1.0x 3.5x 4.1x 26.0x

scanner-rs: Prefilter seeds narrow windows for regex

The Vectorscan prefilter identifies literal anchor hits in a single SIMD pass over the entire buffer. Only the narrow windows around anchor hits are fed to the full regex engine. Most of the input buffer is never touched by regex.

src/engine/buffer_scan.rs:1-16 — Pipeline:

// 1. Prefilter — Run Vectorscan on raw bytes to collect hit windows
// 2. Normalize — Sort, merge adjacent/overlapping windows
// 3. Two-phase confirm — Re-check narrow seed with memmem before expanding
// 4. Validate — Run full regex only within resulting windows

src/engine/core.rs:30-44 — Only windows around anchor hits get regex:

// Run Vectorscan prefilter on root buffer to populate touched pairs.
// Process work items:
//   - ScanBuf: validate regexes in prefilter windows
//   - DecodeSpan: decode, then enqueue ScanBuf for decoded output

TruffleHog: Every detector runs regex on every matched span

../trufflehog/pkg/engine/engine.go:798-819 — After Aho-Corasick pre-filter, each matching detector runs its own regex:

matchingDetectors := e.AhoCorasickCore.FindDetectorMatches(decoded.Chunk.Data)
for _, detector := range matchingDetectors {
    // Each detector internally runs regex on the full chunk
    e.detectableChunksChan <- detectableChunk{
        chunk:    *decoded.Chunk,
        detector: detector,
        // ...
    }
}

Gitleaks: All rules against full input

../gitleaks/detect/detect.go:327-347 — Sequential rule loop, full-input regex:

for _, rule := range d.Config.Rules {
    // ...
    findings = append(findings, d.detectRule(fragment, currentRaw, rule, encodedSegments)...)
}

Each detectRule runs r.Regex.FindAllStringIndex(currentRaw, -1) — regex over the entire fragment for every matching rule. There is no window narrowing.

Why this explains the measurements

scanner-rs skips most of the input buffer entirely. The Vectorscan DFA identifies candidate regions in a single pass; only narrow windows around hits enter the regex engine. Competitors run regex over the full input for each matched rule/detector:

  • Kingfisher: 3.5x more instructions (Vectorscan + per-rule regex, no window narrowing)
  • TruffleHog: 4.1x more instructions (per-detector regex on full chunks)
  • Gitleaks: 26x more instructions (all rules x full input, Go NFA regex)

This is the single largest performance differentiator. All other optimizations (cache alignment, pools, scratch memory) would matter less if the scanner were executing 4-26x more work to begin with.


4.8 I/O Hints and Sequential Access

What we measured:

Cold-to-warm wall time ratios in filesystem mode. A large ratio means the scanner is I/O-efficient (fast once data is cached) and a ratio near 1.0 means the scanner is CPU-bound (I/O was never the bottleneck).

Repo scanner-rs Kingfisher TruffleHog Gitleaks
node 10.1x 4.0x 1.3x 1.1x
vscode 3.3x 1.5x 1.1x 0.9x
linux 13.1x 6.8x 1.0x 1.1x
rocksdb 1.1x 1.1x 1.3x 1.0x
tensorflow 9.5x 3.0x 1.1x 1.0x
Babylon.js 2.1x 1.5x 1.1x 1.0x
gcc 19.8x 8.7x 1.4x 1.0x
jdk 13.8x 4.1x 1.9x 1.3x
Average 9.1x 3.8x 1.3x 1.0x

scanner-rs benefits 9.1x on average from warm page caches; Gitleaks benefits 1.0x. Kingfisher sits in between at 3.8x. The Go scanners (TruffleHog, Gitleaks) show almost no cold/warm delta — they are purely CPU-bound, so I/O latency was never their bottleneck.

scanner-rs: Explicit prefetch hints on every file and mmap

scanner-rs calls posix_fadvise(POSIX_FADV_SEQUENTIAL) on every file descriptor and madvise(MADV_SEQUENTIAL) on every mmap'd region. This is done consistently across all I/O paths.

src/scheduler/local_fs_owner.rs:1044-1056hint_sequential() for local filesystem reads:

/// Advise the kernel that this file will be read sequentially.
///
/// On Linux this doubles the default readahead window and avoids
/// random-access penalties. Advisory and non-blocking; errors ignored.
#[cfg(target_os = "linux")]
fn hint_sequential(file: &File, len: u64) {
    use std::os::unix::io::AsRawFd;
    unsafe {
        let _ = libc::posix_fadvise(
            file.as_raw_fd(),
            0,
            len as libc::off_t,
            libc::POSIX_FADV_SEQUENTIAL,
        );
    }
}

src/git_scan/runner_exec.rs:517-534advise_sequential() for pack file mmaps:

pub(super) fn advise_sequential(file: &File, reader: &Mmap) {
    unsafe {
        #[cfg(target_os = "linux")]
        let _ = libc::posix_fadvise(file.as_raw_fd(), 0, 0, libc::POSIX_FADV_SEQUENTIAL);
        #[cfg(not(target_os = "linux"))]
        let _ = file;
        let _ = libc::madvise(
            reader.as_ptr() as *mut libc::c_void,
            reader.len(),
            libc::MADV_SEQUENTIAL,
        );
    }
}

The same advise_sequential pattern is applied in two additional locations:

  • src/git_scan/pack_io.rs:421-436 — Pack cache entries: posix_fadvise + madvise on every pack file mmap
  • src/git_scan/spill_arena.rs:266-283 — Spill arena: posix_fadvise + madvise on spill file mmaps

src/scheduler/local_fs_owner.rs:38-54 — Overlap-carry I/O pattern eliminates re-reading overlap bytes:

// # I/O Pattern: Overlap Carry
//
// Instead of seeking back for each chunk's overlap:
// 1. Acquire ONE buffer per file (blocking)
// 2. Read sequentially, carry overlap bytes forward via `copy_within`
// 3. Eliminates: seeks, re-reading overlap from kernel, per-chunk pool churn

Kingfisher: Bare mmap without hints

Kingfisher uses memmap2::Mmap::map() without any madvise or fadvise calls. The kernel applies its default readahead policy (typically 128 KiB on Linux).

../kingfisher/src/decompress.rs — Standard mmap, no advisory hints:

let mmap = unsafe { Mmap::map(&file)? };
// No madvise or fadvise calls follow

TruffleHog: Standard Go buffered I/O

TruffleHog uses Go's standard bufio and io.ReadAll for file I/O. Go does not expose posix_fadvise or madvise in its standard library. The kernel uses its default readahead policy.

Gitleaks: Go bufio.Scanner

Gitleaks reads files with bufio.Scanner. Like TruffleHog, there are no explicit prefetch hints. The kernel applies default readahead.

Why this explains the measurements

On Linux, POSIX_FADV_SEQUENTIAL doubles the kernel's default readahead window (from ~128 KiB to ~256 KiB or more). For sequential scans over large files, this reduces the number of I/O round-trips by fetching more data per read. MADV_SEQUENTIAL does the same for mmap'd regions and additionally signals the kernel to proactively drop already-scanned pages, reducing memory pressure from the page cache.

The benchmark storage is EBS (Elastic Block Store) — network-attached storage that presents as NVMe on EC2. Each I/O round-trip on EBS carries higher latency than local NVMe, so reducing round-trip count via larger readahead windows has proportionally more impact.

Honest caveat: the cold/warm ratio captures all I/O design choices together — fadvise/madvise hints, the overlap-carry read pattern, work-stealing I/O pipelining, and buffer pool reuse. We have not isolated the individual contribution of prefetch hints. We can say that scanner-rs is the only scanner making explicit prefetch hints, and the cold/warm ratios are consistent with this mattering, but we cannot attribute the full 9.1x ratio to fadvise alone. These numbers would also likely look different on local NVMe, where device-level prefetching is already aggressive.


4.9 Git Scanning Architecture: Custom Pack Pipeline

Sections 4.1–4.8 cover the detection engine — what happens after bytes reach the scanner. This section covers what happens before: how each scanner extracts those bytes from a git repository. Git-mode speedups (1.3–5.8x vs Kingfisher, 10–60x vs Go scanners) are a headline result, and the git object pipeline is a major contributor.

What we measured

The git warm-cache speedup table from Section 2.3 captures the combined effect:

Repo vs Kingfisher vs TruffleHog vs Gitleaks
node 5.8x 35.6x 30.0x
vscode 2.3x 13.1x 8.3x
linux 2.3x 10.6x 7.8x
rocksdb 2.3x 10.8x 6.7x
tensorflow 2.4x 16.0x 10.5x
Babylon.js 1.3x 11.6x 11.1x
gcc 1.9x 12.4x 60.0x
jdk 1.7x 18.3x 16.1x

These speedups reflect both the detection engine advantages (Sections 4.1–4.7) and the git pipeline differences described below. We cannot separate the two from wall-time data alone.

scanner-rs: Custom pure-Rust pack parser with MIDX indexing

No external git I/O dependencies. scanner-rs implements custom MIDX parsing, pack inflate, and commit-graph walking. The only external git dependency is gix_commitgraph for commit-graph file format parsing (not object access).

  • src/git_scan/midx.rs — Zero-copy multi-pack index parser
  • src/git_scan/pack_inflate.rs — Custom zlib decompression and delta parsing
  • src/git_scan/commit_walk.rs:35 — Uses gix_commitgraph::Position for generation-ordered traversal

Two scan modes (src/git_scan/runner.rs:336-343):

pub enum GitScanMode {
    /// Current diff-history pipeline (tree diff + spill + mapping + pack plan).
    DiffHistory,
    /// ODB-blob fast path (unique-blob walk + pack-order scan).
    #[default]
    OdbBlobFast,
}

ODB-blob pipeline (src/git_scan/runner_odb_blob.rs:1-30) — the default fast path has four stages:

  1. Blob introduction — walks the commit graph and emits (oid, pack_id, path) candidates for each unique blob. Workers share an AtomicSeenSets bitmap for deduplication, each with their own ObjectStore and tree cache.
  2. Pack planning — candidates are bucketed by pack id, then a per-pack plan (topologically sorted decode order including delta base dependencies) is built on the runner thread.
  3. Pack execution — plans are dispatched as scheduler tasks. The strategy selector chooses worker width (1 for serial, pack_exec_workers for parallel).
  4. Loose scan — loose object candidates that did not map to any pack are scanned after all pack plans complete.

Commit traversal (src/git_scan/commit_walk.rs:10-17):

// The introduced-by walk mirrors `git rev-list <tip> ^<watermark>` using two
// generation-ordered heaps: an interesting frontier (commits reachable from
// `tip`) and an uninteresting frontier (commits reachable from `watermark`).
// Before emitting the highest-generation interesting commit, the algorithm
// advances the uninteresting heap down to that generation so any commit
// reachable from the watermark is marked and excluded.

Deterministic ordering. Heap size bounded by CommitWalkLimits::max_heap_entries.

Tree diff (src/git_scan/tree_diff.rs:1-43):

OID-only comparison — O(n) in the number of changed entries. Unchanged subtrees are skipped entirely with no recursion or blob reads:

// - O(n) where n is the number of changed entries
// - Skips unchanged subtrees entirely (no recursion)
// - No blob reads (OID comparison only)
// - Fixed-size stack allocation (bounded depth)
// - Stack is reused across diff_trees calls (no per-call allocation)

MIDX lookups (src/git_scan/midx.rs):

Zero-copy multi-pack index with O(log N) object lookup via fanout-bucketed binary search. Resolves object IDs to (pack_id, offset) pairs without scanning pack files.

Pack planning (src/git_scan/pack_plan.rs:1-20):

Per-pack plans with topological sort respecting delta dependencies. Delta chain depth bounded at 64 (pack_plan.rs:39: DEFAULT_MAX_DELTA_DEPTH). Plans are sorted by offset within each pack for sequential I/O.

Multi-tier parallelism (src/git_scan/runner.rs:167-303):

  • Blob introduction workers: 1–8 threads (line 186-190), with AtomicSeenSets for lock-free deduplication
  • Pack execution workers: auto-scaled by repository size — multiplier of 2–4x cores depending on in-pack object count (line 288-303)
  • Symmetric threads: 2 per worker for I/O overlap (line 167-174)

Spill arena (src/git_scan/spill_arena.rs:1-24):

Mmap-backed append-only arena with dual-mapping strategy: MmapMut for the writer, Arc<Mmap> for readers. posix_fadvise + madvise(MADV_SEQUENTIAL) applied on the spill file.

Kingfisher: gix pure-Rust library

Dependency: gix v0.73 (Cargo.toml:68). All git object I/O goes through the gix ODB abstraction.

Object enumeration (src/git_repo_enumerator.rs:201-210):

for oid_result in odb
    .iter()
    .context("Failed to iterate object database")?
    .with_ordering(Ordering::PackAscendingOffsetThenLooseLexicographical)
{
    let oid = match oid_result {
        Ok(oid) => oid,
        // ...
    };
    let hdr = match odb.header(oid) {

Iterates ALL objects via odb.iter() with PackAscendingOffsetThenLooseLexicographical ordering. This achieves pack-order access similar to scanner-rs, through gix's abstraction rather than a custom parser.

Introduced-blob discovery (src/git_metadata_graph.rs:288-354):

Builds a petgraph DAG of commits. Walks from root commits using a worklist, maintaining a per-commit SeenObjectSet inherited from parents. For each commit, calls visit_tree() to discover blobs new to that commit:

let mut seen_sets: Vec<Option<SeenObjectSet>> = vec![None; num_commits];
let mut blobs_introduced: Vec<IntroducedBlobs> = vec![SmallVec::new(); num_commits];
// ...
while let Some((_, commit_idx)) = commit_worklist.pop() {
    let mut seen = seen_sets[commit_idx.index()].take().unwrap();
    // ...
    visit_tree(repo, &mut symbols, repo_index, /* ... */ &mut seen, introduced, /* ... */)?;

Parallelism: rayon into_par_iter() for per-blob scanning with thread-local repo handles via repo_sync.to_thread_local().

Delta resolution: Abstracted by gix — not visible in Kingfisher application code. No fadvise/madvise calls; gix uses bare Mmap::map().

TruffleHog: git CLI subprocess

Traversal (pkg/gitparse/gitparse.go:247-269):

args := []string{
    "-C", source,
    "log",
    "--patch",
    "--full-history",
    "--date=iso-strict",
    "--pretty=fuller",
    "--notes",
}
// ...
cmd := exec.CommandContext(ctx, "git", args...)

Shells out to git log --patch --full-history. All object decompression, delta resolution, and tree traversal is delegated to the git CLI process.

Scans diff hunks only. Parses unified diff output from the git log pipe. Does NOT scan full file content — only the lines shown in patch output.

Binary files: Fetched via git cat-file blob.

Parallelism: Sequential commit processing from the pipe. Per-fragment concurrency via semaphore.Weighted at runtime.NumCPU().

Gitleaks: git CLI subprocess + go-gitdiff parser

Traversal (sources/git.go:93-94):

cmd = exec.CommandContext(ctx, "git", "-C", sourceClean, "log", "-p", "-U0",
    "--full-history", "--all", "--diff-filter=tuxdb")

Shells out to git log -p -U0 --full-history --all.

Scans added lines only (sources/git.go:394-402):

for _, textFragment := range gitdiffFile.TextFragments {
    fragment := Fragment{
        Raw: textFragment.Raw(gitdiff.OpAdd),
        // ...
    }

Uses gitdiff.OpAdd to extract only added lines from the unified diff output. Does not scan full file content or deleted lines.

Binary files: Fetched via git cat-file blob.

Parallelism: Sequential commits from the pipe. Per-fragment concurrency via semgroup.

Summary table

Dimension scanner-rs Kingfisher TruffleHog / Gitleaks
Git access Custom pack parser gix library (v0.73) git CLI subprocess
What's scanned Full blob (unique set) Full blob (introduced) Diff hunks only
Object discovery MIDX O(log N) Full ODB iteration git log pipe
Parallelism Multi-tier, auto-scaled rayon per-blob Sequential commits
Delta resolution Custom, bounded cache gix-abstracted git CLI
I/O optimization fadvise+madvise on mmaps Bare mmap CLI pipe

Why this explains the measurements

1. Process spawn overhead. TruffleHog and Gitleaks spawn a git log subprocess. All object decompression, delta resolution, and tree traversal goes through the single-threaded git CLI. scanner-rs eliminates IPC by reading pack files directly. Kingfisher also avoids subprocess overhead by using gix in-process.

2. Pack-order decode. scanner-rs builds per-pack plans sorted by offset → sequential I/O, cache-line friendly, fadvise effective. Kingfisher's odb.iter() also uses PackAscendingOffset, achieving similar ordering through gix. Go scanners process in commit-history order, which is effectively random relative to pack layout.

3. Unique-blob dedup. scanner-rs and Kingfisher deduplicate at the blob OID level — each unique blob is decoded and scanned exactly once. Go scanners process per-commit diffs, so if the same change appears in multiple branches, it may be scanned more than once.

4. Multi-tier parallelism. scanner-rs has independent parallelism at blob introduction (atomic seen-sets, 1–8 workers) and pack execution (auto-scaled workers with symmetric I/O threads). Kingfisher uses rayon per-blob parallelism. Go scanners are sequential at the commit level, with only per-fragment concurrency.

5. Full-blob vs diff tradeoff (honest caveat). scanner-rs and Kingfisher scan MORE total data than TruffleHog and Gitleaks. Full-blob scanning reads the complete content of every unique blob; diff scanning reads only added/changed lines. scanner-rs compensates with the detection engine advantages from Sections 4.1–4.7. The full-blob approach also provides complete file context for multi-line pattern matching that diff-only scanners cannot perform.

6. Attribution caveat. Git-mode speedups reflect both the detection engine (Sections 4.1–4.7) and the git pipeline. We do not have isolated perf counter data for the git pipeline alone. The 1.3–5.8x advantage over Kingfisher — which shares the "full blob" scanning approach — is more directly attributable to the pipeline differences described here, since both scanners run the same conceptual workload (decode every unique blob, scan it).


5. The Memory Tradeoff

scanner-rs deliberately trades memory for speed. Every design decision in Section 4 contributes to higher RSS:

Design Decision Memory Cost
Per-worker ScratchVec (page-aligned, fixed capacity) N workers x scratch size
Per-worker VsScratch (Vectorscan scratch space) N workers x Vectorscan scratch
Per-worker BufferPool (8 MiB fixed-size chunks) N workers x N buffers x 8 MiB
NodePoolType (contiguous arena, pre-allocated) Full capacity allocated upfront
Cache-line padding (CachePaddedAtomicU64) 64 bytes per counter (vs 8 bytes unpacked)

Memory comparison

Repo scanner-rs Kingfisher TruffleHog Gitleaks scanner-rs / avg(others)
node 5.5 GiB 2.3 GiB 1.7 GiB 1.6 GiB 2.9x
vscode 5.4 GiB 2.1 GiB 1.6 GiB 1.3 GiB 3.2x
linux 22.9 GiB 8.1 GiB 8.3 GiB 7.2 GiB 2.9x
rocksdb 2.8 GiB 1.6 GiB 403 MiB 403 MiB 3.5x
tensorflow 7.2 GiB 2.4 GiB 1.8 GiB 1.4 GiB 3.9x

Why this is acceptable

  1. Memory is cheap; CPU cycles are not. On modern cloud instances, memory is provisioned in fixed tiers. A scanner that uses 5 GiB vs 2 GiB fits on the same instance tier but finishes 2-60x faster.

  2. Pre-allocation eliminates allocation latency. Every malloc/free in the hot path is a potential TLB miss, page fault, or mmap syscall. By pre-allocating at startup, scanner-rs converts those runtime costs into a one-time startup cost.

  3. Fixed capacity enables compile-time guarantees. AllocGuard::assert_no_alloc() can verify that hot paths are truly allocation-free. Dynamic allocation makes this impossible to guarantee.

  4. Memory scales with worker count, not input size. The memory footprint is proportional to N_workers x scratch_size, not to the size of the repository being scanned. For a given machine configuration, memory usage is predictable.


6. Methodology

6.1 Benchmark Design

  • 128 total runs: 8 repositories x 2 modes (git, filesystem) x 2 cache states (cold, warm)
  • Cold cache: sync && echo 3 > /proc/sys/vm/drop_caches + 2s settle
  • Warm cache: throwaway run first, then measured second run
  • Offline validation only: no live HTTP checks for any scanner
  • Archive scanning: enabled for all scanners
  • Decode depth: 2 for scanner-rs/Gitleaks, default for Kingfisher/TruffleHog

6.2 Scanner Versions

Scanner Version/Commit
scanner-rs e5d217c
Kingfisher 88d3f78
TruffleHog 6961f2bac
Gitleaks ca20267

6.3 Rule Set Normalization

  • scanner-rs: 223 rules
  • Kingfisher: 277 default rules (superset)
  • TruffleHog: filtered to 98 matched detectors via --include-detectors
  • Gitleaks: custom TOML config with 222 scanner-rs-matched rules (1 rule unmatched: vault-service-token-legacy)

scanner-rs's higher finding counts are primarily due to missing false-positive filters (entropy gates, safelists, confidence scoring) rather than rule coverage differences. These filters are planned additions.

6.4 perf stat Design

  • Machine: Same ARM Graviton3 as benchmarks
  • Repo: vscode (git mode, warm cache) — representative mid-size workload
  • Methodology: 1 warmup + 1 measured run per (scanner, event group)
  • Event groups: 4 groups x 6 events, time-multiplexed by kernel
  • perf_event_paranoid: 2 (user-space events, :u suffix)
  • Events measured: cycles, instructions, L1D loads/misses, L1I loads/misses, L2 refills/writebacks/allocations, branch predictions/misses, frontend/backend stalls, dTLB loads/misses/walks, iTLB loads/misses, memory accesses

7. Appendix: Finding Count Comparison

Differences in finding counts reflect different rule sets, matching strategies, and deduplication approaches — not bugs.

Repo Mode scanner-rs Kingfisher TruffleHog Gitleaks
node git 11,168 91,289 842 22,060
vscode git 98,584 303 0 116
linux git 199,422 169 38 463
rocksdb git 71 142 14 29
tensorflow git 14,239 225 5 46
Babylon.js git 1,781 309 1 8
gcc git 17,212 2,097 35 189
jdk git 11,300 3,061 9 306

scanner-rs reports more findings primarily because it lacks false-positive reduction filters that competitors include: entropy gates on the secret span (not the full match window), safelists for known-benign patterns, and confidence scoring. These are planned additions — once entropy gating and safelists are implemented, we expect these counts to drop substantially. TruffleHog reports fewer because many detectors require live verification to confirm (which was disabled). Kingfisher reports more on node due to its larger rule set (277 rules).


Report generated from 128-run benchmark data and perf stat measurements on ARM Graviton3. All source code references are to specific file:line locations verified at report generation time.