Add HTML decoder for secret detection in HTML-formatted sources#4840
Add HTML decoder for secret detection in HTML-formatted sources#4840
Conversation
Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor
2e42a11 to
cd28c03
Compare
- Remove unreachable "xlink:href" map entry: the html parser splits namespace-prefixed attributes into separate Namespace/Key fields, so attr.Key is "href" (already in the map), never "xlink:href". - Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts '+' to space per form-encoding spec, corrupting secrets that contain literal '+' characters (e.g. base64 values, API keys). Made-with: Cursor
casey-tran
left a comment
There was a problem hiding this comment.
Looks pretty good to me! Left a few comments.
pkg/decoders/html.go
Outdated
| // Enabled controls whether the decoder is active. When nil, the decoder | ||
| // is always active. Inject a function that checks a feature flag to | ||
| // allow dynamic toggling without restarting the scanner. | ||
| Enabled func() bool |
There was a problem hiding this comment.
Is Enabled really needed if nothing by default uses this HTML package?
There was a problem hiding this comment.
I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.
There was a problem hiding this comment.
Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.
There was a problem hiding this comment.
Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.
There was a problem hiding this comment.
Confirming our offline chat, I agree with your suggestion to move this to thog.
There was a problem hiding this comment.
The ConfigCat flags should be checked every 10 seconds here.
| // syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting | ||
| // libraries. Elements with these classes mark logical line boundaries in code | ||
| // blocks where the platform (e.g. Teams) strips actual newlines. | ||
| var syntaxHighlightPrefixes = []string{"hljs-"} |
There was a problem hiding this comment.
Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.
There was a problem hiding this comment.
I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
…rruption - Add script/style to blockElements so they get newline boundaries instead of concatenating with adjacent inline text. - Remove redundant `|| n.Data == "br"` since br is already in blockElements. - Move residual entity decoding into walkNode per text node, skipping it for script/style raw-text content where the HTML parser does not decode entities. Made-with: Cursor

Summary
HTMLdecoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.href,src,value,xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.HTML = 5to theDecoderTypeprotobuf enum and registers the decoder inDefaultDecoders().htmlDecoderfeature flag in ConfigCat.Test plan
Made with Cursor
Note
Medium Risk
Adds a new decoder to the default decoding pipeline and extends the
DecoderTypeprotobuf enum, which can affect scanning output/telemetry and downstream consumers despite being gated behind a runtime feature flag.Overview
Adds a new, feature-flagged
HTMLdecoder to the default decoder chain to normalize HTML-formatted source content into scan-friendly text.The decoder parses HTML and emits visible text plus high-signal attribute values and script/style/comment content, with heuristics for token boundaries (block elements,
hljs-*spans), URL-unescaping, double-encoded entity cleanup, and stripping invisible Unicode characters. This introducesDecoderType_HTMLin the protobuf enum and a newfeature.HTMLDecoderEnabledtoggle, with extensive unit tests covering real-world Teams/Confluence-style inputs.Reviewed by Cursor Bugbot for commit c0d437a. Bugbot is set up for automated code reviews on this repo. Configure here.