Skip to content

Add HTML decoder for secret detection in HTML-formatted sources#4840

Open
alafiand wants to merge 7 commits intomainfrom
dl.276-new-html-decoder
Open

Add HTML decoder for secret detection in HTML-formatted sources#4840
alafiand wants to merge 7 commits intomainfrom
dl.276-new-html-decoder

Conversation

@alafiand
Copy link
Copy Markdown

@alafiand alafiand commented Mar 25, 2026

Summary

  • Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
  • Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
  • Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
  • Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().
  • feature is gated behind htmlDecoder feature flag in ConfigCat.

Test plan

  • Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
  • Integration tested via thog dev deploy with live MS Teams and Confluence scanning (companion thog PR)

Made with Cursor


Note

Medium Risk
Adds a new decoder to the default decoding pipeline and extends the DecoderType protobuf enum, which can affect scanning output/telemetry and downstream consumers despite being gated behind a runtime feature flag.

Overview
Adds a new, feature-flagged HTML decoder to the default decoder chain to normalize HTML-formatted source content into scan-friendly text.

The decoder parses HTML and emits visible text plus high-signal attribute values and script/style/comment content, with heuristics for token boundaries (block elements, hljs-* spans), URL-unescaping, double-encoded entity cleanup, and stripping invisible Unicode characters. This introduces DecoderType_HTML in the protobuf enum and a new feature.HTMLDecoderEnabled toggle, with extensive unit tests covering real-world Teams/Confluence-style inputs.

Reviewed by Cursor Bugbot for commit c0d437a. Bugbot is set up for automated code reviews on this repo. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 25, 2026

CLA assistant check
All committers have signed the CLA.

Sources like MS Teams and Confluence emit HTML rather than plain text,
causing secrets split across tags or embedded in attributes to be missed.
This adds an HTML decoder to the pipeline that extracts text nodes,
high-signal attribute values, script/style/comment content, and code blocks.
It handles syntax-highlight boundary detection, zero-width character stripping,
and double-encoded HTML entity decoding.

Made-with: Cursor
@alafiand alafiand force-pushed the dl.276-new-html-decoder branch from 2e42a11 to cd28c03 Compare April 1, 2026 17:28
alafiand and others added 2 commits April 1, 2026 12:20
- Remove unreachable "xlink:href" map entry: the html parser splits
  namespace-prefixed attributes into separate Namespace/Key fields,
  so attr.Key is "href" (already in the map), never "xlink:href".
- Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts
  '+' to space per form-encoding spec, corrupting secrets that contain
  literal '+' characters (e.g. base64 values, API keys).

Made-with: Cursor
@alafiand alafiand marked this pull request as ready for review April 1, 2026 20:15
@alafiand alafiand requested a review from a team April 1, 2026 20:15
@alafiand alafiand requested review from a team as code owners April 1, 2026 20:15
Copy link
Copy Markdown
Contributor

@casey-tran casey-tran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! Left a few comments.

// Enabled controls whether the decoder is active. When nil, the decoder
// is always active. Inject a function that checks a feature flag to
// allow dynamic toggling without restarting the scanner.
Enabled func() bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Enabled really needed if nothing by default uses this HTML package?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming our offline chat, I agree with your suggestion to move this to thog.

Copy link
Copy Markdown
Contributor

@casey-tran casey-tran Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigCat flags should be checked every 10 seconds here.

// syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting
// libraries. Elements with these classes mark logical line boundaries in code
// blocks where the platform (e.g. Teams) strips actual newlines.
var syntaxHighlightPrefixes = []string{"hljs-"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

…rruption

- Add script/style to blockElements so they get newline boundaries
  instead of concatenating with adjacent inline text.
- Remove redundant `|| n.Data == "br"` since br is already in blockElements.
- Move residual entity decoding into walkNode per text node, skipping
  it for script/style raw-text content where the HTML parser does not
  decode entities.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants