Add HTML decoder for secret detection in HTML-formatted sources by alafiand · Pull Request #4840 · trufflesecurity/trufflehog

alafiand · 2026-03-25T21:00:13Z

Summary

Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().
feature is gated behind htmlDecoder feature flag in ConfigCat.

Test plan

Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
Integration tested via thog dev deploy with live MS Teams and Confluence scanning (companion thog PR)

Made with Cursor

Note

Medium Risk
Adds a new decoder to the default decoding pipeline and extends the DecoderType protobuf enum, which can affect scanning output/telemetry and downstream consumers despite being gated behind a runtime feature flag.

Overview
Adds a new, feature-flagged HTML decoder to the default decoder chain to normalize HTML-formatted source content into scan-friendly text.

The decoder parses HTML and emits visible text plus high-signal attribute values and script/style/comment content, with heuristics for token boundaries (block elements, hljs-* spans), URL-unescaping, double-encoded entity cleanup, and stripping invisible Unicode characters. This introduces DecoderType_HTML in the protobuf enum and a new feature.HTMLDecoderEnabled toggle, with extensive unit tests covering real-world Teams/Confluence-style inputs.

^{Reviewed by Cursor Bugbot for commit c0d437a. Bugbot is set up for automated code reviews on this repo. Configure here.}

CLAassistant · 2026-03-25T21:00:20Z

All committers have signed the CLA.

pkg/decoders/html.go

Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor

pkg/decoders/html.go

- Remove unreachable "xlink:href" map entry: the html parser splits namespace-prefixed attributes into separate Namespace/Key fields, so attr.Key is "href" (already in the map), never "xlink:href". - Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts '+' to space per form-encoding spec, corrupting secrets that contain literal '+' characters (e.g. base64 values, API keys). Made-with: Cursor

casey-tran

Looks pretty good to me! Left a few comments.

casey-tran · 2026-04-02T18:26:36Z

pkg/decoders/html.go

+	// Enabled controls whether the decoder is active. When nil, the decoder
+	// is always active. Inject a function that checks a feature flag to
+	// allow dynamic toggling without restarting the scanner.
+	Enabled func() bool


Is Enabled really needed if nothing by default uses this HTML package?

I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.

Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.

Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.

Confirming our offline chat, I agree with your suggestion to move this to thog.

The ConfigCat flags should be checked every 10 seconds here.

casey-tran · 2026-04-02T18:37:40Z

pkg/decoders/html.go

+// syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting
+// libraries. Elements with these classes mark logical line boundaries in code
+// blocks where the platform (e.g. Teams) strips actual newlines.
+var syntaxHighlightPrefixes = []string{"hljs-"}


Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.

I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.

pkg/decoders/html.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

pkg/decoders/html.go

…rruption - Add script/style to blockElements so they get newline boundaries instead of concatenating with adjacent inline text. - Remove redundant `|| n.Data == "br"` since br is already in blockElements. - Move residual entity decoding into walkNode per text node, skipping it for script/style raw-text content where the HTML parser does not decode entities. Made-with: Cursor

cursor bot reviewed Mar 25, 2026

View reviewed changes

pkg/decoders/html.go Show resolved Hide resolved

alafiand force-pushed the dl.276-new-html-decoder branch from 2e42a11 to cd28c03 Compare April 1, 2026 17:28

cursor bot reviewed Apr 1, 2026

View reviewed changes

pkg/decoders/html.go Outdated Show resolved Hide resolved

alafiand and others added 2 commits April 1, 2026 12:20

Merge branch 'main' into dl.276-new-html-decoder

86a9a9f

alafiand marked this pull request as ready for review April 1, 2026 20:15

alafiand requested a review from a team April 1, 2026 20:15

alafiand requested review from a team as code owners April 1, 2026 20:15

casey-tran reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into dl.276-new-html-decoder

779dcef

cursor bot reviewed Apr 2, 2026

View reviewed changes

pkg/decoders/html.go Show resolved Hide resolved

pkg/decoders/html.go Show resolved Hide resolved

alafiand added 2 commits April 2, 2026 14:46

updated comment around syntaxHighlightPrefixes to guide future additions

38a1763

removed Enabled func from HTML struct to follow normal flag conventions

2a2997c

cursor bot reviewed Apr 3, 2026

View reviewed changes

pkg/decoders/html.go Show resolved Hide resolved

Conversation

alafiand commented Mar 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CLAassistant commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casey-tran left a comment

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alafiand commented Mar 25, 2026 •

edited by cursor bot

Loading

CLAassistant commented Mar 25, 2026 •

edited

Loading

casey-tran Apr 2, 2026 •

edited

Loading