Web Source by kashifkhan0771 · Pull Request #4848 · trufflesecurity/trufflehog

kashifkhan0771 · 2026-03-30T11:40:44Z

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Adds a new networked crawling source and CLI command that fetches and scans arbitrary URLs, introducing operational risk (traffic/timeout/robots handling) and new third-party scraping dependencies.

Overview
Adds a new web scan mode that can fetch one or more URLs and optionally crawl same-domain links (including linked scripts) to emit page content as chunks for secret scanning.

Wires the source end-to-end: new CLI command/flags in main.go, engine entrypoint ScanWeb, new sourcespb/source_metadata protobuf types (with validation stubs) and WebConfig for configuration, plus Prometheus metric web_urls_scanned. Includes a full pkg/sources/web implementation with timeout/delay/depth/robots controls and comprehensive unit tests, and updates go.mod/go.sum with Colly and HTML parsing dependencies.

^{Reviewed by Cursor Bugbot for commit 7564078. Bugbot is set up for automated code reviews on this repo. Configure here.}

pkg/engine/web.go

pkg/sources/web/web.go

main.go

cursor · 2026-03-30T12:28:16Z

pkg/sources/web/web.go

+	case <-ctx.Done():
+		ctx.Logger().Info("Context cancelled or timeout reached")
+		<-done // Wait for goroutine to finish cleanup
+		return ctx.Err()


Timeout blocks indefinitely waiting for Colly cleanup

Medium Severity

When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

cursor · 2026-04-02T16:37:24Z

pkg/sources/web/web.go

+		if _, err := url.Parse(u); err != nil {
+			return fmt.Errorf("invalid URL %q: %w", u, err)
+		}
+	}


URL validation too permissive to catch invalid input

Medium Severity

The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.

MuneebUllahKhan222

Just need address couple of small changes.

MuneebUllahKhan222 · 2026-04-06T10:41:03Z

pkg/sources/web/web.go

+	defer cancel()
+
+	eg, _ := errgroup.WithContext(crawlCtx)
+


Need to add this line to enforce max number of go routines eg.SetLimit(s.concurrency)

MuneebUllahKhan222 · 2026-04-06T11:10:52Z

pkg/sources/web/web.go

+				return
+			}
+
+			if err := e.Request.Visit(link); err != nil {


I think we should log error even when err does not satisfies colly.AlreadyVisitedError

MuneebUllahKhan222 · 2026-04-06T11:30:05Z

pkg/sources/web/web.go

+			ctx.Logger().Error(err, "Visit failed")
+		}
+		collector.Wait() // blocks until all requests finish
+		close(done)


it should be defer close(done) outside the go routine.

MuneebUllahKhan222 · 2026-04-06T11:43:56Z

pkg/sources/web/web.go

+		s.conn.Timeout = 30
+	}
+
+	if s.conn.GetIgnoreRobots() {


I think we remove this to avoid probable miss-use.

MuneebUllahKhan222 · 2026-04-06T12:12:46Z

pkg/sources/web/web.go

+	}
+
+	// request validations
+	collector.OnRequest(func(r *colly.Request) {


Not sure but I think we can avoid this logic by doing

c := colly.NewCollector( colly.AllowedDomains("foo.com", "bar.com"), )

kashifkhan0771 added 11 commits March 27, 2026 15:26

basic structure for source

71d3b22

it works end to end

ee3c447

some more enhancements + README.md

c9a95bd

A simple working test

d036ca5

user-agent flag

976fc15

made ignore-robots configurable

d34aa84

added metric

b2161da

detailed test cases

259f6d6

fixed some comments

37caab9

updated README.md

4ca9e46

Merge branch 'main' into feature/web-source

2403fb4

kashifkhan0771 requested a review from a team March 30, 2026 11:40

kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/engine/web.go Show resolved Hide resolved

pkg/sources/web/web.go Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 16:50

Added missed config in engine and rewrite timeout comment

9288366

fixed lint issues

2a52dab

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

pkg/sources/web/web.go Outdated Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 17:16

fixed allowed domains validation

4952315

fixed comment

d830d51

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

main.go Show resolved Hide resolved

fixed sub-domain filter

4336d62

cursor bot reviewed Mar 30, 2026

View reviewed changes

kashifkhan0771 requested review from amanfcp, camgunz and rosecodym March 31, 2026 05:22

Merge branch 'main' into feature/web-source

a8889df

cursor bot reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/web-source

d4b7a57

Merge branch 'main' into feature/web-source

7564078

MuneebUllahKhan222 requested changes Apr 6, 2026

View reviewed changes

MuneebUllahKhan222 reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Source#4848

Web Source#4848
kashifkhan0771 wants to merge 19 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 30, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

MuneebUllahKhan222 left a comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kashifkhan0771 commented Mar 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 30, 2026

Choose a reason for hiding this comment

Timeout blocks indefinitely waiting for Colly cleanup

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

URL validation too permissive to catch invalid input

Uh oh!

MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading