Skip to content

Shift-left leak detection POC (Innovation Week)#927

Open
zachchentouf wants to merge 14 commits into
mainfrom
zach.chentouf/shift-left-leak-detection-poc-innovation-week
Open

Shift-left leak detection POC (Innovation Week)#927
zachchentouf wants to merge 14 commits into
mainfrom
zach.chentouf/shift-left-leak-detection-poc-innovation-week

Conversation

@zachchentouf

@zachchentouf zachchentouf commented Jun 22, 2026

Copy link
Copy Markdown

What problem are you trying to solve?

SDS detects and redacts sensitive data (PII, payment cards, secrets) at runtime. By then the leak already happened — a secret was logged, a card was written to a table. There is no Datadog way to tell a developer at PR time: "this line will create a sensitive-data leak, fix it before it ships."

What is your solution?

A deterministic Privacy Code Scanner built on the analyzer's existing Java taint engine (ddsa.getTaintSources). It flags risky sensitive-data → sink flows in source code:

  • Detects PII/secrets flowing into logs, database writes, and outbound HTTP, attaching the full source→sink taint path to each finding.
  • "What counts as sensitive" is generated from an SDS rule export (a converter maps each SDS rule's name + default_included_keywords onto identifier-name heuristics), keeping the scanner consistent with Code Security. Only the credentialed fetch of the live catalog remains a manual step.
  • Ships fully self-contained under misc/shift-left-leak-detection/ (editable rule + build script + sample repo + one-command demo + local findings UI), plus kernel unit tests asserting the rule fires.

This is a POC / Innovation Week artifact — nothing in the shipping crates changes except added test coverage.

Alternatives considered

An LLM-per-PR approach (see dd-source#262313). The deterministic route was chosen for fewer false positives (tunable, SDS-consistent), better scaling on long PRs, and no per-PR cost. The two are complementary: deterministic as a high-recall first layer, LLM as a precision filter — captured in the roadmap.

What the reviewer should know

  • Taint analysis is Java-only and intra-method today; a Python rule demonstrates the same idea with lighter direct + shallow matching until the engine gains a Python MethodFlow.
  • SDS value-regexes are intentionally not used for matching — source code at PR time has identifier names, not runtime values — so the integration maps SDS keywords, not patterns.
  • See misc/shift-left-leak-detection/DESIGN.md for decisions, tradeoffs, the UI story, and the roadmap; PLAN.md for the implementation plan.
  • Verified end-to-end: 11 findings across the Java + Python samples, zero on the clean cases; cargo test privacy_leak_poc passes (12 tests).

Real-world validation

Tested against real Datadog Java code, not just the demo samples:

  • The original motivating PR (logs-backend#109418): caught both real leak lines (raw JWT token + decoded payload). A naive first pass also flagged ~20 claims extracted from the verified token; modeling a laundering boundary (pass-through ops like split/decode keep data tainted; verify/getClaim end the trail) cut it to exactly the 2 real leaks, 0 false positives.
  • Scale test — 5,465 logs-backend Java files (~1.6s): first pass = 43 findings (5 real: emails + a token value; 38 FPs). The FPs fell into four systematic classes; targeted fixes (whole-word camelCase matching, metadata-suffix suppression, logger-receiver check, skip tests) brought it to 19 findings with all 5 true positives kept — precision ~12% → ~26%, recall preserved. Each fix is locked in by a regression test.

Takeaway: deterministic detection finds real leaks fast and its noise is systematic and fixable; the residual semantic FPs (e.g. a pagination continuationToken) are the natural handoff to an LLM second-layer filter.

🤖 Generated with Claude Code

zachchentouf and others added 4 commits June 22, 2026 14:19
Privacy Code Scanner: detect sensitive-data (PII, payment data, secrets)
flows into logging sinks in Java source code at PR time, before the data
leaks at runtime. Built as a local rules-file on top of the analyzer's
existing Java intra-method taint engine (ddsa.getTaintSources) — no
compiler changes, runs fully offline via `-r`.

Self-contained under misc/shift-left-leak-detection/: editable rule
sources (tree-sitter query + visit() JS), a build script that packages
them, a sample Java repo (leaky + clean), a one-command demo runner, and
PLAN/DESIGN/README docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Generalize the rule from log-only to multiple sink categories: logging,
  database writes (prepareStatement/executeQuery/JPA save/persist/merge),
  and outbound HTTP (RestTemplate). Rename pii-into-log -> pii-into-sink.
- Externalize the sensitive-data vocabulary into sensitive-patterns.json,
  an SDS-shaped category/keyword library injected into the rule at build
  time, so the real SDS catalog can be dropped in without code changes.
- Add DbService/ApiClient sample files (leaky + clean cases per sink type).
- Add a self-contained local findings UI (ui/ + view-findings.sh) that
  renders findings and source->sink taint paths from the SARIF.

Verified: 8 findings (UserService 4 log, DbService 2 db, ApiClient 2 http);
clean methods and SafeService produce zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Loads the actual rule sources from misc/shift-left-leak-detection/ (with
the same SENSITIVE_PATTERNS injection as build-rules.sh) and runs them
through the kernel JS runtime, so the tests can't drift from the shipped
rule. Asserts findings on PII->log/db/http flows (including the taint
path) and zero findings on clean code, string-literal-only matches, and
sensitive values flowing into non-sink methods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ddsa taint engine (flow/java.js) is Java-only, so full taint analysis
isn't available for other languages yet. Add a Python rule that
demonstrates the same source->sink idea with a lighter, documented
technique: direct matches plus a shallow 1-hop assignment lookback within
the enclosing function. Shares the SENSITIVE_PATTERNS library with the
Java rule.

- build-rules.sh now emits a 2-rule ruleset (Java + Python).
- Add a Python sample (leaky + clean) and 3 Python kernel tests.

Verified: 11 findings total (8 Java + 3 Python); clean cases stay silent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@zachchentouf zachchentouf requested a review from a team as a code owner June 22, 2026 19:26
Copilot AI review requested due to automatic review settings June 22, 2026 19:26
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 22, 2026

Copy link
Copy Markdown

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 85.65% (+0.65%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f3289a4 | Docs | Datadog PR Page | Give us feedback!


def audit(account):
# VULNERABLE: SSN attribute logged directly.
print(account.ssn)
Replace the hand-curated pattern stand-in with a real integration: a
converter (rules/sds/sds-to-patterns.py) maps an SDS rule export
(SecretRule schema — rule name + default_included_keywords) onto the
scanner's identifier-name heuristics. SDS value-regexes are intentionally
ignored, since source code at PR time has names, not values.

- sync-sds-patterns.sh regenerates rules/src/sensitive-patterns.json from
  an SDS export (defaults to the checked-in schema-faithful sample);
  build-rules.sh can auto-sync via SDS_RULES_FILE.
- The committed library is now generated from rules/sds/sds-rules.sample.json.
- README documents fetching the live catalog (GET /api/v2/static-analysis/
  secrets/rules) as the only remaining manual step.

Also add the UI piece to the README quick start.

Verified: 11 findings unchanged; cargo test privacy_leak_poc passes (10).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Innovation Week “Privacy Code Scanner” POC that demonstrates deterministic sensitive-data → sink detection using the existing analyzer rule system (tree-sitter + JS) and Java taint engine, plus a lightweight Python variant. It ships as a self-contained demo under misc/shift-left-leak-detection/ and adds kernel unit tests to prevent the POC rule from drifting.

Changes:

  • Adds POC rule sources (Java taint + Python shallow), an SDS-shaped sensitive-pattern library, and a build script that packages them into a local rules JSON.
  • Adds runnable demo assets: vulnerable/clean sample repos, run-demo.sh, and a local findings viewer UI + view-findings.sh.
  • Adds kernel tests that execute the actual POC rule sources through the JS runtime.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
misc/shift-left-leak-detection/view-findings.sh Generates UI findings from SARIF and serves the local viewer
misc/shift-left-leak-detection/ui/style.css Styling for the local findings viewer
misc/shift-left-leak-detection/ui/index.html Static UI shell for viewing findings
misc/shift-left-leak-detection/ui/findings.json Checked-in sample/enriched findings payload for the UI
misc/shift-left-leak-detection/ui/app.js Renders findings and taint paths in the browser
misc/shift-left-leak-detection/sample/src/main/java/com/example/UserService.java Vulnerable Java logging-sink examples
misc/shift-left-leak-detection/sample/src/main/java/com/example/SafeService.java Clean Java examples (false-positive guard)
misc/shift-left-leak-detection/sample/src/main/java/com/example/DbService.java Vulnerable + clean DB-write sink examples
misc/shift-left-leak-detection/sample/src/main/java/com/example/ApiClient.java Vulnerable + clean outbound-HTTP sink examples
misc/shift-left-leak-detection/sample/python/user_service.py Vulnerable + clean Python logging examples
misc/shift-left-leak-detection/run-demo.sh One-command scan + SARIF summarizer
misc/shift-left-leak-detection/rules/src/sensitive-patterns.json SDS-shaped sensitive-data categories/keywords
misc/shift-left-leak-detection/rules/src/pii-into-sink.tsquery Java tree-sitter query capturing method invocations
misc/shift-left-leak-detection/rules/src/pii-into-sink.js Java rule logic: sink catalog + taint walk + matcher
misc/shift-left-leak-detection/rules/src/pii-into-sink-python.tsquery Python tree-sitter query capturing calls
misc/shift-left-leak-detection/rules/src/pii-into-sink-python.js Python rule logic: direct + shallow assignment matching
misc/shift-left-leak-detection/rules/privacy-leak-rules.json Generated packaged rules file consumed by --rules
misc/shift-left-leak-detection/README.md POC instructions and explanation
misc/shift-left-leak-detection/PLAN.md Implementation plan (now partially out of sync with filenames)
misc/shift-left-leak-detection/DESIGN.md Design/decisions/roadmap (now partially out of sync with filenames)
misc/shift-left-leak-detection/build-rules.sh Packager injecting patterns + computing checksums
crates/static-analysis-kernel/src/analysis/ddsa_lib/privacy_leak_poc_tests.rs Kernel tests that execute the POC rules
crates/static-analysis-kernel/src/analysis/ddsa_lib.rs Wires the new test module behind cfg(test)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +32 to +35
import json, os, functools

sarif = json.load(open(os.environ["SARIF_OUT"]))
sample = os.environ["SAMPLE"]
Comment on lines +69 to +75
for step in reversed(tf.get("locations", [])):
p = step.get("location", {}).get("physicalLocation", {})
su = p.get("artifactLocation", {}).get("uri", uri)
sl = p.get("region", {}).get("startLine")
if sl and sl not in seen:
seen.add(sl)
flow.append({"line": sl, "code": src(su, sl)})
Comment on lines +37 to +42
@functools.lru_cache(maxsize=None)
def lines_of(uri):
try:
return open(os.path.join(sample, uri)).read().splitlines()
except OSError:
return []

RULES="$HERE/rules/privacy-leak-rules.json"
SAMPLE="$HERE/sample"
SARIF_OUT="/tmp/leak.sarif"
Comment on lines +72 to +79
const head = document.createElement("div");
head.className = "card-head";
head.innerHTML =
`<span class="badge ${f.sink}">${SINK_LABELS[f.sink] || f.sink}</span>` +
`<span class="badge sev">${(f.severity || "warning").toLowerCase()}</span>` +
`<span class="category">${escapeHtml(f.category || "sensitive data")}</span>` +
`<span class="loc"><code>${escapeHtml(f.file)}:${f.line}</code></span>`;
el.appendChild(head);
Comment on lines +71 to +74
┌─────────────────────┐ build-rules.sh ┌──────────────────────────┐
│ rules/src/ │ base64 + sha256 │ rules/ │
│ pii-into-log.tsquery│ ───────────────────────▶│ privacy-leak-rules.json │
│ pii-into-log.js │ (assemble ruleset) │ (the -r input) │
Comment on lines +156 to +157
| 1 | Tree-sitter query (sink finder) | `rules/src/pii-into-log.tsquery` | Compiles; matches `logger.x(...)` calls in the sample |
| 2 | Rule logic (`visit()` + taint walk + PII heuristics) | `rules/src/pii-into-log.js` | Flags PII→log flows, ignores clean logs |
Comment on lines +54 to +60
@functools.lru_cache(maxsize=None)
def lines_of(uri):
try:
with open(os.path.join(sample, uri)) as fh:
return fh.read().splitlines()
except OSError:
return []
<main>
<div id="filters" class="filters"></div>
<section id="findings" class="findings"></section>
<p id="empty" class="empty" hidden>No findings loaded. Run <code>./view-findings.sh</code>.</p>
Comment on lines +26 to +33
// ---- Sink catalog: method name -> human label for the message ----
const SINKS = new Map([
// Logging (slf4j / log4j / java.util.logging / System.out|err)
["info", "a log statement"], ["warn", "a log statement"], ["warning", "a log statement"],
["error", "a log statement"], ["debug", "a log statement"], ["trace", "a log statement"],
["fatal", "a log statement"], ["severe", "a log statement"], ["config", "a log statement"],
["fine", "a log statement"], ["finer", "a log statement"], ["finest", "a log statement"],
["log", "a log statement"], ["println", "a log statement"], ["print", "a log statement"],
zachchentouf and others added 9 commits June 23, 2026 11:31
Validated against the real logs-backend#109418 leak. First pass caught
the actual leak (raw JWT token + decoded payload logged) but also fired
~20 false positives on claims extracted from the *verified* token.

Two fixes in pii-into-sink.js:
- Match a value's name, not its full expression text: a method_invocation
  matches on its method name (verify/getStringClaim), not its receiver or
  args, so tokenVerifier.verify(jwtToken) is no longer read as a credential.
- Laundering boundary: stop walking a taint flow at the first non-pass-through
  call. split/decode/substring preserve the raw secret; verify/getStringClaim
  produce a derived value and end the trail.

Result on that file: 2 real findings, 0 false positives (was 22). Locked in
by a new kernel test (flags_raw_token_but_not_verified_claims). Sample still
11 findings; all 11 kernel tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scanned a 5,465-file logs-backend slice. Caught real leaks (emails logged
in an auth filter / GCP clients, a token value in a converter) but also
~38 false positives. Four fixes, each targeting one FP class:

- Whole-word, camelCase-aware matching instead of substrings: "pan" no
  longer matches "span"/"skippedPaNotEnabled", "ssn" not "className".
- Metadata suppression: a keyword followed by a metadata word (path/type/
  id/class/...) is skipped (secretPath, credentialType, tokenId), while
  whole-name matches (userId, apiKey, secretKey) still fire.
- Logger-receiver check: logging sinks must have a logger-like receiver,
  so Result.error(...) / ValidationResult.error(...) are no longer sinks.
- Skip test sources.
- Drop ultra-generic SDS keywords (pan/mobile/swift) in the converter.

Result on the slice: 43 -> 19 findings, all 5 true positives kept
(precision ~12% -> ~26%). Sample still 11; logs-backend#109418 still 2;
12 kernel tests pass (new precision_fixes_from_scale_test).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ft-leak-detection-poc-innovation-week

improve Python scanning
…ft-leak-detection-poc-innovation-week

add LLM calls scanning
Acting on PR review feedback:

- LLM classifier (llm/classify.py): sends each deterministic finding +
  code context to an AI-Gateway / OpenAI-compatible endpoint for a
  detect/ignore verdict. Prompt + one-word DETECT/IGNORE protocol mirror
  Datadog's generic-secrets validator (sds-shared-library), adapted from
  "is it a secret literal" to "is this a real leak", and cover the log/
  file/db/http/LLM sink types. Caching, dry-run, and --self-test against
  10 human-labeled logs-backend findings. Validated with a stand-in model:
  3/3 true positives kept, 7/7 false positives removed (precision 30->100%).
- Developer "ignore" affordance: documented + demonstrated inline
  no-dd-sa:privacy-leak/pii-into-sink suppression (SuppressionExample.java);
  run-demo/view-findings exclude suppressed findings (11 active + 1 suppressed).
- PR integration: example diff-aware GitHub Actions workflow that uploads
  SARIF to Code Scanning, plus docs for the Datadog Code Security path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- run-demo.sh / view-findings.sh take an optional DIR arg (default sample/),
  so the demo can point at a real repo and show real findings in the UI.
- Auto-handle the *.datadog.yml-vs---rules conflict: move the repo's config
  aside for the scan, restore it after (keeps the target repo clean).
- Add fetch-logs-backend-slice.sh: blobless + sparse checkout of a few
  logs-backend domains over HTTPS for a quick real-repo demo.
- README: document scanning a real repo / logs-backend.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants