Conversation
| case <-ctx.Done(): | ||
| ctx.Logger().Info("Context cancelled or timeout reached") | ||
| <-done // Wait for goroutine to finish cleanup | ||
| return ctx.Err() |
There was a problem hiding this comment.
Timeout blocks indefinitely waiting for Colly cleanup
Medium Severity
When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.
| if _, err := url.Parse(u); err != nil { | ||
| return fmt.Errorf("invalid URL %q: %w", u, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
URL validation too permissive to catch invalid input
Medium Severity
The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.
MuneebUllahKhan222
left a comment
There was a problem hiding this comment.
Just need address couple of small changes.
| defer cancel() | ||
|
|
||
| eg, _ := errgroup.WithContext(crawlCtx) | ||
|
|
There was a problem hiding this comment.
Need to add this line to enforce max number of go routines eg.SetLimit(s.concurrency)
| return | ||
| } | ||
|
|
||
| if err := e.Request.Visit(link); err != nil { |
There was a problem hiding this comment.
I think we should log error even when err does not satisfies colly.AlreadyVisitedError
| ctx.Logger().Error(err, "Visit failed") | ||
| } | ||
| collector.Wait() // blocks until all requests finish | ||
| close(done) |
There was a problem hiding this comment.
it should be defer close(done) outside the go routine.
| s.conn.Timeout = 30 | ||
| } | ||
|
|
||
| if s.conn.GetIgnoreRobots() { |
There was a problem hiding this comment.
I think we remove this to avoid probable miss-use.
| } | ||
|
|
||
| // request validations | ||
| collector.OnRequest(func(r *colly.Request) { |
There was a problem hiding this comment.
Not sure but I think we can avoid this logic by doing
c := colly.NewCollector(
colly.AllowedDomains("foo.com", "bar.com"),
)


Description:
Adds a new
websource that crawls and scans websites for exposed secrets. The source uses theCollyframework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via--crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Adds a new networked crawling source and CLI command that fetches and scans arbitrary URLs, introducing operational risk (traffic/timeout/robots handling) and new third-party scraping dependencies.
Overview
Adds a new
webscan mode that can fetch one or more URLs and optionally crawl same-domain links (including linked scripts) to emit page content as chunks for secret scanning.Wires the source end-to-end: new CLI command/flags in
main.go, engine entrypointScanWeb, newsourcespb/source_metadataprotobuf types (with validation stubs) andWebConfigfor configuration, plus Prometheus metricweb_urls_scanned. Includes a fullpkg/sources/webimplementation with timeout/delay/depth/robots controls and comprehensive unit tests, and updatesgo.mod/go.sumwith Colly and HTML parsing dependencies.Reviewed by Cursor Bugbot for commit 7564078. Bugbot is set up for automated code reviews on this repo. Configure here.