Skip to content

[BUG] harden single-node deployment against memory-safety and DoS attacks#7098

Open
rescrv wants to merge 3 commits into
mainfrom
rescrv/bugfixes
Open

[BUG] harden single-node deployment against memory-safety and DoS attacks#7098
rescrv wants to merge 3 commits into
mainfrom
rescrv/bugfixes

Conversation

@rescrv
Copy link
Copy Markdown
Contributor

@rescrv rescrv commented May 20, 2026

Description of changes

Address five chained vulnerabilities in the single-node (sqlite + local HNSW)
deployment that together enable unauthenticated heap corruption with a
credible path to remote code execution.

Dimension TOCTOU (CWE-367 → CWE-787/CWE-125):

  • Add a per-collection mutex around dimension initialization in
    validate_embedding so concurrent /add requests cannot race to set
    conflicting dimensions.
  • Validate persisted HNSW header dimensionality on cold-load against the
    sysdb dimension; reject mismatches before constructing the index.
  • Stage id_map mutations in a pending copy during apply_log_chunk and
    commit only on success, preventing partial state on error.
  • Validate embedding dimensionality per-record in the HNSW writer.

Unbounded HNSW configuration (CWE-20/CWE-1284 → CWE-770):

  • Add upper bounds to all HNSW tunables: max_neighbors ≤ 128,
    ef_construction ≤ 4096, ef_search ≤ 4096, resize_factor ≤ 10.0,
    sync_threshold ≤ 4096.
  • Enforce bounds at every ingestion point: schema validation,
    InternalCollectionConfiguration, UpdateCollectionConfiguration,
    and the compaction pipeline.
  • Skip HNSW backfill for collections with invalid persisted config
    but still advance the metadata watermark so log purging proceeds.

Concurrent cache-miss race (CWE-362 → CWE-664):

  • Introduce a partitioned async mutex (AysncPartitionedMutex) in
    LocalSegmentManager keyed on IndexUuid. Double-check the cache
    after acquiring the lock to prevent two HnswIndex instances from
    sharing the same on-disk segment directory.

delete_collection resource leak (CWE-401/CWE-772 → CWE-400):

  • Actively evict and close HNSW indexes on collection deletion.
  • Delete on-disk segment directories via tombstone-rename + periodic
    background cleanup of orphaned HNSW index dirs (6-hour age gate).
  • Purge sqlite log rows for deleted collections.
  • Propagate segment deletion through delete_database as well.
  • Serialize database-level create/delete behind a per-database lock.

Log purge skipped on backfill error (CWE-754 → CWE-400):

  • Always dispatch PurgeLogsMessage after BackfillMessage regardless
    of backfill outcome; the purge handler independently computes
    watermarks from persisted segment state.
  • For collections with invalid HNSW config, fall back to the
    metadata-segment watermark so logs do not accumulate unboundedly.

Where-clause stack overflow:

  • Cap $or/$and recursion depth at 64 in both parse_where and
    parse_where_document to prevent attacker-triggered stack overflow
    that could crash the tokio runtime.

Test plan

Local testing of new tests + CI for our full suite.

Migration plan

This code migrates itself.

Observability plan

N/A

Documentation Changes

Will need to update docs to document limitations and recovery for corrupted hnsw segments.

Co-authored-by: AI

…acks

Address five chained vulnerabilities in the single-node (sqlite + local HNSW)
deployment that together enable unauthenticated heap corruption with a
credible path to remote code execution.

Dimension TOCTOU (CWE-367 → CWE-787/CWE-125):
- Add a per-collection mutex around dimension initialization in
  validate_embedding so concurrent /add requests cannot race to set
  conflicting dimensions.
- Validate persisted HNSW header dimensionality on cold-load against the
  sysdb dimension; reject mismatches before constructing the index.
- Stage id_map mutations in a pending copy during apply_log_chunk and
  commit only on success, preventing partial state on error.
- Validate embedding dimensionality per-record in the HNSW writer.

Unbounded HNSW configuration (CWE-20/CWE-1284 → CWE-770):
- Add upper bounds to all HNSW tunables: max_neighbors ≤ 128,
  ef_construction ≤ 4096, ef_search ≤ 4096, resize_factor ≤ 10.0,
  sync_threshold ≤ 4096.
- Enforce bounds at every ingestion point: schema validation,
  InternalCollectionConfiguration, UpdateCollectionConfiguration,
  and the compaction pipeline.
- Skip HNSW backfill for collections with invalid persisted config
  but still advance the metadata watermark so log purging proceeds.

Concurrent cache-miss race (CWE-362 → CWE-664):
- Introduce a partitioned async mutex (AysncPartitionedMutex) in
  LocalSegmentManager keyed on IndexUuid. Double-check the cache
  after acquiring the lock to prevent two HnswIndex instances from
  sharing the same on-disk segment directory.

delete_collection resource leak (CWE-401/CWE-772 → CWE-400):
- Actively evict and close HNSW indexes on collection deletion.
- Delete on-disk segment directories via tombstone-rename + periodic
  background cleanup of orphaned HNSW index dirs (6-hour age gate).
- Purge sqlite log rows for deleted collections.
- Propagate segment deletion through delete_database as well.
- Serialize database-level create/delete behind a per-database lock.

Log purge skipped on backfill error (CWE-754 → CWE-400):
- Always dispatch PurgeLogsMessage after BackfillMessage regardless
  of backfill outcome; the purge handler independently computes
  watermarks from persisted segment state.
- For collections with invalid HNSW config, fall back to the
  metadata-segment watermark so logs do not accumulate unboundedly.

Where-clause stack overflow:
- Cap $or/$and recursion depth at 64 in both parse_where and
  parse_where_document to prevent attacker-triggered stack overflow
  that could crash the tokio runtime.

Co-authored-by: AI
@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@rescrv rescrv requested a review from HammadB May 22, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant