Shift-left leak detection POC (Innovation Week) by zachchentouf · Pull Request #927 · DataDog/datadog-static-analyzer

zachchentouf · 2026-06-22T19:26:56Z

What problem are you trying to solve?

SDS detects and redacts sensitive data (PII, payment cards, secrets) at runtime. By then the leak already happened — a secret was logged, a card was written to a table. There is no Datadog way to tell a developer at PR time: "this line will create a sensitive-data leak, fix it before it ships."

What is your solution?

A deterministic Privacy Code Scanner built on the analyzer's existing Java taint engine (ddsa.getTaintSources). It flags risky sensitive-data → sink flows in source code:

Detects PII/secrets flowing into logs, database writes, and outbound HTTP, attaching the full source→sink taint path to each finding.
"What counts as sensitive" is generated from an SDS rule export (a converter maps each SDS rule's name + default_included_keywords onto identifier-name heuristics), keeping the scanner consistent with Code Security. Only the credentialed fetch of the live catalog remains a manual step.
Ships fully self-contained under misc/shift-left-leak-detection/ (editable rule + build script + sample repo + one-command demo + local findings UI), plus kernel unit tests asserting the rule fires.

This is a POC / Innovation Week artifact — nothing in the shipping crates changes except added test coverage.

Alternatives considered

An LLM-per-PR approach (see dd-source#262313). The deterministic route was chosen for fewer false positives (tunable, SDS-consistent), better scaling on long PRs, and no per-PR cost. The two are complementary: deterministic as a high-recall first layer, LLM as a precision filter — captured in the roadmap.

What the reviewer should know

Taint analysis is Java-only and intra-method today; a Python rule demonstrates the same idea with lighter direct + shallow matching until the engine gains a Python MethodFlow.
SDS value-regexes are intentionally not used for matching — source code at PR time has identifier names, not runtime values — so the integration maps SDS keywords, not patterns.
See misc/shift-left-leak-detection/DESIGN.md for decisions, tradeoffs, the UI story, and the roadmap; PLAN.md for the implementation plan.
Verified end-to-end: 11 findings across the Java + Python samples, zero on the clean cases; cargo test privacy_leak_poc passes (12 tests).

Real-world validation

Tested against real Datadog Java code, not just the demo samples:

The original motivating PR (logs-backend#109418): caught both real leak lines (raw JWT token + decoded payload). A naive first pass also flagged ~20 claims extracted from the verified token; modeling a laundering boundary (pass-through ops like split/decode keep data tainted; verify/getClaim end the trail) cut it to exactly the 2 real leaks, 0 false positives.
Scale test — 5,465 logs-backend Java files (~1.6s): first pass = 43 findings (5 real: emails + a token value; 38 FPs). The FPs fell into four systematic classes; targeted fixes (whole-word camelCase matching, metadata-suffix suppression, logger-receiver check, skip tests) brought it to 19 findings with all 5 true positives kept — precision ~12% → ~26%, recall preserved. Each fix is locked in by a regression test.

Takeaway: deterministic detection finds real leaks fast and its noise is systematic and fixable; the residual semantic FPs (e.g. a pagination continuationToken) are the natural handoff to an LLM second-layer filter.

🤖 Generated with Claude Code

Privacy Code Scanner: detect sensitive-data (PII, payment data, secrets) flows into logging sinks in Java source code at PR time, before the data leaks at runtime. Built as a local rules-file on top of the analyzer's existing Java intra-method taint engine (ddsa.getTaintSources) — no compiler changes, runs fully offline via `-r`. Self-contained under misc/shift-left-leak-detection/: editable rule sources (tree-sitter query + visit() JS), a build script that packages them, a sample Java repo (leaky + clean), a one-command demo runner, and PLAN/DESIGN/README docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Generalize the rule from log-only to multiple sink categories: logging, database writes (prepareStatement/executeQuery/JPA save/persist/merge), and outbound HTTP (RestTemplate). Rename pii-into-log -> pii-into-sink. - Externalize the sensitive-data vocabulary into sensitive-patterns.json, an SDS-shaped category/keyword library injected into the rule at build time, so the real SDS catalog can be dropped in without code changes. - Add DbService/ApiClient sample files (leaky + clean cases per sink type). - Add a self-contained local findings UI (ui/ + view-findings.sh) that renders findings and source->sink taint paths from the SARIF. Verified: 8 findings (UserService 4 log, DbService 2 db, ApiClient 2 http); clean methods and SafeService produce zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Loads the actual rule sources from misc/shift-left-leak-detection/ (with the same SENSITIVE_PATTERNS injection as build-rules.sh) and runs them through the kernel JS runtime, so the tests can't drift from the shipped rule. Asserts findings on PII->log/db/http flows (including the taint path) and zero findings on clean code, string-literal-only matches, and sensitive values flowing into non-sink methods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The ddsa taint engine (flow/java.js) is Java-only, so full taint analysis isn't available for other languages yet. Add a Python rule that demonstrates the same source->sink idea with a lighter, documented technique: direct matches plus a shallow 1-hop assignment lookback within the enclosing function. Shares the SENSITIVE_PATTERNS library with the Java rule. - build-rules.sh now emits a 2-rule ruleset (Java + Python). - Add a Python sample (leaky + clean) and 3 Python kernel tests. Verified: 11 findings total (8 Java + 3 Python); clean cases stay silent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

datadog-datadog-prod-us1-2 · 2026-06-22T19:27:06Z

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 85.65% (+0.65%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: f3289a4 | Docs | Datadog PR Page | Give us feedback!}

+
+def audit(account):
+    # VULNERABLE: SSN attribute logged directly.
+    print(account.ssn)


Replace the hand-curated pattern stand-in with a real integration: a converter (rules/sds/sds-to-patterns.py) maps an SDS rule export (SecretRule schema — rule name + default_included_keywords) onto the scanner's identifier-name heuristics. SDS value-regexes are intentionally ignored, since source code at PR time has names, not values. - sync-sds-patterns.sh regenerates rules/src/sensitive-patterns.json from an SDS export (defaults to the checked-in schema-faithful sample); build-rules.sh can auto-sync via SDS_RULES_FILE. - The committed library is now generated from rules/sds/sds-rules.sample.json. - README documents fetching the live catalog (GET /api/v2/static-analysis/ secrets/rules) as the only remaining manual step. Also add the UI piece to the README quick start. Verified: 11 findings unchanged; cargo test privacy_leak_poc passes (10). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Adds an Innovation Week “Privacy Code Scanner” POC that demonstrates deterministic sensitive-data → sink detection using the existing analyzer rule system (tree-sitter + JS) and Java taint engine, plus a lightweight Python variant. It ships as a self-contained demo under misc/shift-left-leak-detection/ and adds kernel unit tests to prevent the POC rule from drifting.

Changes:

Adds POC rule sources (Java taint + Python shallow), an SDS-shaped sensitive-pattern library, and a build script that packages them into a local rules JSON.
Adds runnable demo assets: vulnerable/clean sample repos, run-demo.sh, and a local findings viewer UI + view-findings.sh.
Adds kernel tests that execute the actual POC rule sources through the JS runtime.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
misc/shift-left-leak-detection/view-findings.sh	Generates UI findings from SARIF and serves the local viewer
misc/shift-left-leak-detection/ui/style.css	Styling for the local findings viewer
misc/shift-left-leak-detection/ui/index.html	Static UI shell for viewing findings
misc/shift-left-leak-detection/ui/findings.json	Checked-in sample/enriched findings payload for the UI
misc/shift-left-leak-detection/ui/app.js	Renders findings and taint paths in the browser
misc/shift-left-leak-detection/sample/src/main/java/com/example/UserService.java	Vulnerable Java logging-sink examples
misc/shift-left-leak-detection/sample/src/main/java/com/example/SafeService.java	Clean Java examples (false-positive guard)
misc/shift-left-leak-detection/sample/src/main/java/com/example/DbService.java	Vulnerable + clean DB-write sink examples
misc/shift-left-leak-detection/sample/src/main/java/com/example/ApiClient.java	Vulnerable + clean outbound-HTTP sink examples
misc/shift-left-leak-detection/sample/python/user_service.py	Vulnerable + clean Python logging examples
misc/shift-left-leak-detection/run-demo.sh	One-command scan + SARIF summarizer
misc/shift-left-leak-detection/rules/src/sensitive-patterns.json	SDS-shaped sensitive-data categories/keywords
misc/shift-left-leak-detection/rules/src/pii-into-sink.tsquery	Java tree-sitter query capturing method invocations
misc/shift-left-leak-detection/rules/src/pii-into-sink.js	Java rule logic: sink catalog + taint walk + matcher
misc/shift-left-leak-detection/rules/src/pii-into-sink-python.tsquery	Python tree-sitter query capturing calls
misc/shift-left-leak-detection/rules/src/pii-into-sink-python.js	Python rule logic: direct + shallow assignment matching
misc/shift-left-leak-detection/rules/privacy-leak-rules.json	Generated packaged rules file consumed by `--rules`
misc/shift-left-leak-detection/README.md	POC instructions and explanation
misc/shift-left-leak-detection/PLAN.md	Implementation plan (now partially out of sync with filenames)
misc/shift-left-leak-detection/DESIGN.md	Design/decisions/roadmap (now partially out of sync with filenames)
misc/shift-left-leak-detection/build-rules.sh	Packager injecting patterns + computing checksums
crates/static-analysis-kernel/src/analysis/ddsa_lib/privacy_leak_poc_tests.rs	Kernel tests that execute the POC rules
crates/static-analysis-kernel/src/analysis/ddsa_lib.rs	Wires the new test module behind `cfg(test)`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+import json, os, functools
+
+sarif = json.load(open(os.environ["SARIF_OUT"]))
+sample = os.environ["SAMPLE"]


+                for step in reversed(tf.get("locations", [])):
+                    p = step.get("location", {}).get("physicalLocation", {})
+                    su = p.get("artifactLocation", {}).get("uri", uri)
+                    sl = p.get("region", {}).get("startLine")
+                    if sl and sl not in seen:
+                        seen.add(sl)
+                        flow.append({"line": sl, "code": src(su, sl)})


+@functools.lru_cache(maxsize=None)
+def lines_of(uri):
+    try:
+        return open(os.path.join(sample, uri)).read().splitlines()
+    except OSError:
+        return []


+
+RULES="$HERE/rules/privacy-leak-rules.json"
+SAMPLE="$HERE/sample"
+SARIF_OUT="/tmp/leak.sarif"


+  const head = document.createElement("div");
+  head.className = "card-head";
+  head.innerHTML =
+    `<span class="badge ${f.sink}">${SINK_LABELS[f.sink] || f.sink}</span>` +
+    `<span class="badge sev">${(f.severity || "warning").toLowerCase()}</span>` +
+    `<span class="category">${escapeHtml(f.category || "sensitive data")}</span>` +
+    `<span class="loc"><code>${escapeHtml(f.file)}:${f.line}</code></span>`;
+  el.appendChild(head);


+  ┌─────────────────────┐     build-rules.sh      ┌──────────────────────────┐
+  │ rules/src/           │  base64 + sha256        │ rules/                   │
+  │  pii-into-log.tsquery│ ───────────────────────▶│  privacy-leak-rules.json │
+  │  pii-into-log.js     │  (assemble ruleset)     │  (the -r input)          │


+| 1 | Tree-sitter query (sink finder) | `rules/src/pii-into-log.tsquery` | Compiles; matches `logger.x(...)` calls in the sample |
+| 2 | Rule logic (`visit()` + taint walk + PII heuristics) | `rules/src/pii-into-log.js` | Flags PII→log flows, ignores clean logs |


+@functools.lru_cache(maxsize=None)
+def lines_of(uri):
+    try:
+        with open(os.path.join(sample, uri)) as fh:
+            return fh.read().splitlines()
+    except OSError:
+        return []


+  <main>
+    <div id="filters" class="filters"></div>
+    <section id="findings" class="findings"></section>
+    <p id="empty" class="empty" hidden>No findings loaded. Run <code>./view-findings.sh</code>.</p>


+    // ---- Sink catalog: method name -> human label for the message ----
+    const SINKS = new Map([
+        // Logging (slf4j / log4j / java.util.logging / System.out|err)
+        ["info", "a log statement"], ["warn", "a log statement"], ["warning", "a log statement"],
+        ["error", "a log statement"], ["debug", "a log statement"], ["trace", "a log statement"],
+        ["fatal", "a log statement"], ["severe", "a log statement"], ["config", "a log statement"],
+        ["fine", "a log statement"], ["finer", "a log statement"], ["finest", "a log statement"],
+        ["log", "a log statement"], ["println", "a log statement"], ["print", "a log statement"],


Validated against the real logs-backend#109418 leak. First pass caught the actual leak (raw JWT token + decoded payload logged) but also fired ~20 false positives on claims extracted from the *verified* token. Two fixes in pii-into-sink.js: - Match a value's name, not its full expression text: a method_invocation matches on its method name (verify/getStringClaim), not its receiver or args, so tokenVerifier.verify(jwtToken) is no longer read as a credential. - Laundering boundary: stop walking a taint flow at the first non-pass-through call. split/decode/substring preserve the raw secret; verify/getStringClaim produce a derived value and end the trail. Result on that file: 2 real findings, 0 false positives (was 22). Locked in by a new kernel test (flags_raw_token_but_not_verified_claims). Sample still 11 findings; all 11 kernel tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Scanned a 5,465-file logs-backend slice. Caught real leaks (emails logged in an auth filter / GCP clients, a token value in a converter) but also ~38 false positives. Four fixes, each targeting one FP class: - Whole-word, camelCase-aware matching instead of substrings: "pan" no longer matches "span"/"skippedPaNotEnabled", "ssn" not "className". - Metadata suppression: a keyword followed by a metadata word (path/type/ id/class/...) is skipped (secretPath, credentialType, tokenId), while whole-name matches (userId, apiKey, secretKey) still fire. - Logger-receiver check: logging sinks must have a logger-like receiver, so Result.error(...) / ValidationResult.error(...) are no longer sinks. - Skip test sources. - Drop ultra-generic SDS keywords (pan/mobile/swift) in the converter. Result on the slice: 43 -> 19 findings, all 5 true positives kept (precision ~12% -> ~26%). Sample still 11; logs-backend#109418 still 2; 12 kernel tests pass (new precision_fixes_from_scale_test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ft-leak-detection-poc-innovation-week improve Python scanning

…ft-leak-detection-poc-innovation-week add LLM calls scanning

Acting on PR review feedback: - LLM classifier (llm/classify.py): sends each deterministic finding + code context to an AI-Gateway / OpenAI-compatible endpoint for a detect/ignore verdict. Prompt + one-word DETECT/IGNORE protocol mirror Datadog's generic-secrets validator (sds-shared-library), adapted from "is it a secret literal" to "is this a real leak", and cover the log/ file/db/http/LLM sink types. Caching, dry-run, and --self-test against 10 human-labeled logs-backend findings. Validated with a stand-in model: 3/3 true positives kept, 7/7 false positives removed (precision 30->100%). - Developer "ignore" affordance: documented + demonstrated inline no-dd-sa:privacy-leak/pii-into-sink suppression (SuppressionExample.java); run-demo/view-findings exclude suppressed findings (11 active + 1 suppressed). - PR integration: example diff-aware GitHub Actions workflow that uploads SARIF to Code Scanning, plus docs for the Datadog Code Security path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- run-demo.sh / view-findings.sh take an optional DIR arg (default sample/), so the demo can point at a real repo and show real findings in the UI. - Auto-handle the *.datadog.yml-vs---rules conflict: move the repo's config aside for the scan, restore it after (keeps the target repo clean). - Add fetch-logs-backend-slice.sh: blobless + sparse checkout of a few logs-backend domains over HTTPS for a quick real-repo demo. - README: document scanning a real repo / logs-backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

zachchentouf and others added 4 commits June 22, 2026 14:19

zachchentouf requested a review from a team as a code owner June 22, 2026 19:26

Copilot AI review requested due to automatic review settings June 22, 2026 19:26

Copilot started reviewing on behalf of zachchentouf June 22, 2026 19:27 View session

github-advanced-security AI found potential problems Jun 22, 2026

View reviewed changes

Comment thread misc/shift-left-leak-detection/sample/python/user_service.py

def audit(account):

# VULNERABLE: SSN attribute logged directly.

print(account.ssn)

Copilot AI reviewed Jun 22, 2026

View reviewed changes

zachchentouf and others added 9 commits June 23, 2026 11:31

improve Python scanning

0b46281

add LLM calls scanning

e924604

Merge pull request #929 from DataDog/origin/alexandre.fouchs/shift-le…

d82a4e6

…ft-leak-detection-poc-innovation-week improve Python scanning

add tests

58e8579

Merge pull request #930 from DataDog/origin/alexandre.fouchs/shift-le…

71a6911

…ft-leak-detection-poc-innovation-week add LLM calls scanning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shift-left leak detection POC (Innovation Week)#927

Shift-left leak detection POC (Innovation Week)#927
zachchentouf wants to merge 14 commits into
mainfrom
zach.chentouf/shift-left-leak-detection-poc-innovation-week

zachchentouf commented Jun 22, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 22, 2026 •

edited by datadog-prod-us1-5 Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		\| 1 \| Tree-sitter query (sink finder) \| `rules/src/pii-into-log.tsquery` \| Compiles; matches `logger.x(...)` calls in the sample \|
		\| 2 \| Rule logic (`visit()` + taint walk + PII heuristics) \| `rules/src/pii-into-log.js` \| Flags PII→log flows, ignores clean logs \|

Uh oh!

Conversation

zachchentouf commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem are you trying to solve?

What is your solution?

Alternatives considered

What the reviewer should know

Real-world validation

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 22, 2026 • edited by datadog-prod-us1-5 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zachchentouf commented Jun 22, 2026 •

edited

Loading

datadog-datadog-prod-us1-2 Bot commented Jun 22, 2026 •

edited by datadog-prod-us1-5 Bot

Loading