Shift-left leak detection POC (Innovation Week)#927
Open
zachchentouf wants to merge 14 commits into
Open
Conversation
Privacy Code Scanner: detect sensitive-data (PII, payment data, secrets) flows into logging sinks in Java source code at PR time, before the data leaks at runtime. Built as a local rules-file on top of the analyzer's existing Java intra-method taint engine (ddsa.getTaintSources) — no compiler changes, runs fully offline via `-r`. Self-contained under misc/shift-left-leak-detection/: editable rule sources (tree-sitter query + visit() JS), a build script that packages them, a sample Java repo (leaky + clean), a one-command demo runner, and PLAN/DESIGN/README docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Generalize the rule from log-only to multiple sink categories: logging, database writes (prepareStatement/executeQuery/JPA save/persist/merge), and outbound HTTP (RestTemplate). Rename pii-into-log -> pii-into-sink. - Externalize the sensitive-data vocabulary into sensitive-patterns.json, an SDS-shaped category/keyword library injected into the rule at build time, so the real SDS catalog can be dropped in without code changes. - Add DbService/ApiClient sample files (leaky + clean cases per sink type). - Add a self-contained local findings UI (ui/ + view-findings.sh) that renders findings and source->sink taint paths from the SARIF. Verified: 8 findings (UserService 4 log, DbService 2 db, ApiClient 2 http); clean methods and SafeService produce zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Loads the actual rule sources from misc/shift-left-leak-detection/ (with the same SENSITIVE_PATTERNS injection as build-rules.sh) and runs them through the kernel JS runtime, so the tests can't drift from the shipped rule. Asserts findings on PII->log/db/http flows (including the taint path) and zero findings on clean code, string-literal-only matches, and sensitive values flowing into non-sink methods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ddsa taint engine (flow/java.js) is Java-only, so full taint analysis isn't available for other languages yet. Add a Python rule that demonstrates the same source->sink idea with a lighter, documented technique: direct matches plus a shallow 1-hop assignment lookback within the enclosing function. Shares the SENSITIVE_PATTERNS library with the Java rule. - build-rules.sh now emits a 2-rule ruleset (Java + Python). - Add a Python sample (leaky + clean) and 3 Python kernel tests. Verified: 11 findings total (8 Java + 3 Python); clean cases stay silent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
🎯 Code Coverage (details) 🔗 Commit SHA: f3289a4 | Docs | Datadog PR Page | Give us feedback! |
|
|
||
| def audit(account): | ||
| # VULNERABLE: SSN attribute logged directly. | ||
| print(account.ssn) |
Replace the hand-curated pattern stand-in with a real integration: a converter (rules/sds/sds-to-patterns.py) maps an SDS rule export (SecretRule schema — rule name + default_included_keywords) onto the scanner's identifier-name heuristics. SDS value-regexes are intentionally ignored, since source code at PR time has names, not values. - sync-sds-patterns.sh regenerates rules/src/sensitive-patterns.json from an SDS export (defaults to the checked-in schema-faithful sample); build-rules.sh can auto-sync via SDS_RULES_FILE. - The committed library is now generated from rules/sds/sds-rules.sample.json. - README documents fetching the live catalog (GET /api/v2/static-analysis/ secrets/rules) as the only remaining manual step. Also add the UI piece to the README quick start. Verified: 11 findings unchanged; cargo test privacy_leak_poc passes (10). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an Innovation Week “Privacy Code Scanner” POC that demonstrates deterministic sensitive-data → sink detection using the existing analyzer rule system (tree-sitter + JS) and Java taint engine, plus a lightweight Python variant. It ships as a self-contained demo under misc/shift-left-leak-detection/ and adds kernel unit tests to prevent the POC rule from drifting.
Changes:
- Adds POC rule sources (Java taint + Python shallow), an SDS-shaped sensitive-pattern library, and a build script that packages them into a local rules JSON.
- Adds runnable demo assets: vulnerable/clean sample repos,
run-demo.sh, and a local findings viewer UI +view-findings.sh. - Adds kernel tests that execute the actual POC rule sources through the JS runtime.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| misc/shift-left-leak-detection/view-findings.sh | Generates UI findings from SARIF and serves the local viewer |
| misc/shift-left-leak-detection/ui/style.css | Styling for the local findings viewer |
| misc/shift-left-leak-detection/ui/index.html | Static UI shell for viewing findings |
| misc/shift-left-leak-detection/ui/findings.json | Checked-in sample/enriched findings payload for the UI |
| misc/shift-left-leak-detection/ui/app.js | Renders findings and taint paths in the browser |
| misc/shift-left-leak-detection/sample/src/main/java/com/example/UserService.java | Vulnerable Java logging-sink examples |
| misc/shift-left-leak-detection/sample/src/main/java/com/example/SafeService.java | Clean Java examples (false-positive guard) |
| misc/shift-left-leak-detection/sample/src/main/java/com/example/DbService.java | Vulnerable + clean DB-write sink examples |
| misc/shift-left-leak-detection/sample/src/main/java/com/example/ApiClient.java | Vulnerable + clean outbound-HTTP sink examples |
| misc/shift-left-leak-detection/sample/python/user_service.py | Vulnerable + clean Python logging examples |
| misc/shift-left-leak-detection/run-demo.sh | One-command scan + SARIF summarizer |
| misc/shift-left-leak-detection/rules/src/sensitive-patterns.json | SDS-shaped sensitive-data categories/keywords |
| misc/shift-left-leak-detection/rules/src/pii-into-sink.tsquery | Java tree-sitter query capturing method invocations |
| misc/shift-left-leak-detection/rules/src/pii-into-sink.js | Java rule logic: sink catalog + taint walk + matcher |
| misc/shift-left-leak-detection/rules/src/pii-into-sink-python.tsquery | Python tree-sitter query capturing calls |
| misc/shift-left-leak-detection/rules/src/pii-into-sink-python.js | Python rule logic: direct + shallow assignment matching |
| misc/shift-left-leak-detection/rules/privacy-leak-rules.json | Generated packaged rules file consumed by --rules |
| misc/shift-left-leak-detection/README.md | POC instructions and explanation |
| misc/shift-left-leak-detection/PLAN.md | Implementation plan (now partially out of sync with filenames) |
| misc/shift-left-leak-detection/DESIGN.md | Design/decisions/roadmap (now partially out of sync with filenames) |
| misc/shift-left-leak-detection/build-rules.sh | Packager injecting patterns + computing checksums |
| crates/static-analysis-kernel/src/analysis/ddsa_lib/privacy_leak_poc_tests.rs | Kernel tests that execute the POC rules |
| crates/static-analysis-kernel/src/analysis/ddsa_lib.rs | Wires the new test module behind cfg(test) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+32
to
+35
| import json, os, functools | ||
|
|
||
| sarif = json.load(open(os.environ["SARIF_OUT"])) | ||
| sample = os.environ["SAMPLE"] |
Comment on lines
+69
to
+75
| for step in reversed(tf.get("locations", [])): | ||
| p = step.get("location", {}).get("physicalLocation", {}) | ||
| su = p.get("artifactLocation", {}).get("uri", uri) | ||
| sl = p.get("region", {}).get("startLine") | ||
| if sl and sl not in seen: | ||
| seen.add(sl) | ||
| flow.append({"line": sl, "code": src(su, sl)}) |
Comment on lines
+37
to
+42
| @functools.lru_cache(maxsize=None) | ||
| def lines_of(uri): | ||
| try: | ||
| return open(os.path.join(sample, uri)).read().splitlines() | ||
| except OSError: | ||
| return [] |
|
|
||
| RULES="$HERE/rules/privacy-leak-rules.json" | ||
| SAMPLE="$HERE/sample" | ||
| SARIF_OUT="/tmp/leak.sarif" |
Comment on lines
+72
to
+79
| const head = document.createElement("div"); | ||
| head.className = "card-head"; | ||
| head.innerHTML = | ||
| `<span class="badge ${f.sink}">${SINK_LABELS[f.sink] || f.sink}</span>` + | ||
| `<span class="badge sev">${(f.severity || "warning").toLowerCase()}</span>` + | ||
| `<span class="category">${escapeHtml(f.category || "sensitive data")}</span>` + | ||
| `<span class="loc"><code>${escapeHtml(f.file)}:${f.line}</code></span>`; | ||
| el.appendChild(head); |
Comment on lines
+71
to
+74
| ┌─────────────────────┐ build-rules.sh ┌──────────────────────────┐ | ||
| │ rules/src/ │ base64 + sha256 │ rules/ │ | ||
| │ pii-into-log.tsquery│ ───────────────────────▶│ privacy-leak-rules.json │ | ||
| │ pii-into-log.js │ (assemble ruleset) │ (the -r input) │ |
Comment on lines
+156
to
+157
| | 1 | Tree-sitter query (sink finder) | `rules/src/pii-into-log.tsquery` | Compiles; matches `logger.x(...)` calls in the sample | | ||
| | 2 | Rule logic (`visit()` + taint walk + PII heuristics) | `rules/src/pii-into-log.js` | Flags PII→log flows, ignores clean logs | |
Comment on lines
+54
to
+60
| @functools.lru_cache(maxsize=None) | ||
| def lines_of(uri): | ||
| try: | ||
| with open(os.path.join(sample, uri)) as fh: | ||
| return fh.read().splitlines() | ||
| except OSError: | ||
| return [] |
| <main> | ||
| <div id="filters" class="filters"></div> | ||
| <section id="findings" class="findings"></section> | ||
| <p id="empty" class="empty" hidden>No findings loaded. Run <code>./view-findings.sh</code>.</p> |
Comment on lines
+26
to
+33
| // ---- Sink catalog: method name -> human label for the message ---- | ||
| const SINKS = new Map([ | ||
| // Logging (slf4j / log4j / java.util.logging / System.out|err) | ||
| ["info", "a log statement"], ["warn", "a log statement"], ["warning", "a log statement"], | ||
| ["error", "a log statement"], ["debug", "a log statement"], ["trace", "a log statement"], | ||
| ["fatal", "a log statement"], ["severe", "a log statement"], ["config", "a log statement"], | ||
| ["fine", "a log statement"], ["finer", "a log statement"], ["finest", "a log statement"], | ||
| ["log", "a log statement"], ["println", "a log statement"], ["print", "a log statement"], |
Validated against the real logs-backend#109418 leak. First pass caught the actual leak (raw JWT token + decoded payload logged) but also fired ~20 false positives on claims extracted from the *verified* token. Two fixes in pii-into-sink.js: - Match a value's name, not its full expression text: a method_invocation matches on its method name (verify/getStringClaim), not its receiver or args, so tokenVerifier.verify(jwtToken) is no longer read as a credential. - Laundering boundary: stop walking a taint flow at the first non-pass-through call. split/decode/substring preserve the raw secret; verify/getStringClaim produce a derived value and end the trail. Result on that file: 2 real findings, 0 false positives (was 22). Locked in by a new kernel test (flags_raw_token_but_not_verified_claims). Sample still 11 findings; all 11 kernel tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scanned a 5,465-file logs-backend slice. Caught real leaks (emails logged in an auth filter / GCP clients, a token value in a converter) but also ~38 false positives. Four fixes, each targeting one FP class: - Whole-word, camelCase-aware matching instead of substrings: "pan" no longer matches "span"/"skippedPaNotEnabled", "ssn" not "className". - Metadata suppression: a keyword followed by a metadata word (path/type/ id/class/...) is skipped (secretPath, credentialType, tokenId), while whole-name matches (userId, apiKey, secretKey) still fire. - Logger-receiver check: logging sinks must have a logger-like receiver, so Result.error(...) / ValidationResult.error(...) are no longer sinks. - Skip test sources. - Drop ultra-generic SDS keywords (pan/mobile/swift) in the converter. Result on the slice: 43 -> 19 findings, all 5 true positives kept (precision ~12% -> ~26%). Sample still 11; logs-backend#109418 still 2; 12 kernel tests pass (new precision_fixes_from_scale_test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ft-leak-detection-poc-innovation-week improve Python scanning
…ft-leak-detection-poc-innovation-week add LLM calls scanning
Acting on PR review feedback: - LLM classifier (llm/classify.py): sends each deterministic finding + code context to an AI-Gateway / OpenAI-compatible endpoint for a detect/ignore verdict. Prompt + one-word DETECT/IGNORE protocol mirror Datadog's generic-secrets validator (sds-shared-library), adapted from "is it a secret literal" to "is this a real leak", and cover the log/ file/db/http/LLM sink types. Caching, dry-run, and --self-test against 10 human-labeled logs-backend findings. Validated with a stand-in model: 3/3 true positives kept, 7/7 false positives removed (precision 30->100%). - Developer "ignore" affordance: documented + demonstrated inline no-dd-sa:privacy-leak/pii-into-sink suppression (SuppressionExample.java); run-demo/view-findings exclude suppressed findings (11 active + 1 suppressed). - PR integration: example diff-aware GitHub Actions workflow that uploads SARIF to Code Scanning, plus docs for the Datadog Code Security path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- run-demo.sh / view-findings.sh take an optional DIR arg (default sample/), so the demo can point at a real repo and show real findings in the UI. - Auto-handle the *.datadog.yml-vs---rules conflict: move the repo's config aside for the scan, restore it after (keeps the target repo clean). - Add fetch-logs-backend-slice.sh: blobless + sparse checkout of a few logs-backend domains over HTTPS for a quick real-repo demo. - README: document scanning a real repo / logs-backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem are you trying to solve?
SDS detects and redacts sensitive data (PII, payment cards, secrets) at runtime. By then the leak already happened — a secret was logged, a card was written to a table. There is no Datadog way to tell a developer at PR time: "this line will create a sensitive-data leak, fix it before it ships."
What is your solution?
A deterministic Privacy Code Scanner built on the analyzer's existing Java taint engine (
ddsa.getTaintSources). It flags risky sensitive-data → sink flows in source code:default_included_keywordsonto identifier-name heuristics), keeping the scanner consistent with Code Security. Only the credentialed fetch of the live catalog remains a manual step.misc/shift-left-leak-detection/(editable rule + build script + sample repo + one-command demo + local findings UI), plus kernel unit tests asserting the rule fires.This is a POC / Innovation Week artifact — nothing in the shipping crates changes except added test coverage.
Alternatives considered
An LLM-per-PR approach (see
dd-source#262313). The deterministic route was chosen for fewer false positives (tunable, SDS-consistent), better scaling on long PRs, and no per-PR cost. The two are complementary: deterministic as a high-recall first layer, LLM as a precision filter — captured in the roadmap.What the reviewer should know
MethodFlow.misc/shift-left-leak-detection/DESIGN.mdfor decisions, tradeoffs, the UI story, and the roadmap;PLAN.mdfor the implementation plan.cargo test privacy_leak_pocpasses (12 tests).Real-world validation
Tested against real Datadog Java code, not just the demo samples:
logs-backend#109418): caught both real leak lines (raw JWT token + decoded payload). A naive first pass also flagged ~20 claims extracted from the verified token; modeling a laundering boundary (pass-through ops likesplit/decodekeep data tainted;verify/getClaimend the trail) cut it to exactly the 2 real leaks, 0 false positives.Takeaway: deterministic detection finds real leaks fast and its noise is systematic and fixable; the residual semantic FPs (e.g. a pagination
continuationToken) are the natural handoff to an LLM second-layer filter.🤖 Generated with Claude Code