Add blog post: 2026-04-29-Building-an-AI-Powered-Exploratory-Tester-P… by Trimble006 · Pull Request #425 · ScottLogic/blog

Trimble006 · 2026-05-12T09:56:15Z

https://trimble006.github.io/blog/2026/04/29/Building-an-AI-Powered-Exploratory-Tester-Part-1.html)

Have you (please tick each box to show completion):

[ y] Added your blog post to a single category?
[ y] Added a brief summary for your post? Summaries should be roughly two sentences in length and give potential readers a good idea of the contents of your post.
[ y] Checked that the build passes?
[ y] Checked your spelling (you can use npm install followed by npx mdspell "**/{FILE_NAME}.md" --en-gb -a -n -x -t if that's your thing)
[ profile updated] Ensured that your author profile contains a profile image, and a brief description of yourself? (make it more interesting than just your job title!)
[ y] Optimised any images in your post? They should be less than 100KBytes as a general guide.
Posts are reviewed / approved by your Regional Tech Lead.

…art-1.md

csalt-scottlogic

I think overall I agree with most of your main points!

My big question - as someone who is a dev, not a tester - is: how is an AI exploratory testing agent different from automated or semiautomated fuzzing testing. By which I mean, it obviously is different, but some parts of what you were describing essentially talked as if the AI was just doing adversarial fuzzing in one way or another, and it left me wondering why you weren't talking about that

(Personally I am a big advocate of some degree of fuzzing in unit testing, but I'm aware that some developers really don't like it because they stick to the line that unit tests should always be deterministic. In personal projects, fuzzy unit tests have caught so many bugs for me though)

It is very long, and as I was reading, I wasn't sure it had a strong "through" line, a strong direction from start to end. It has the potential, though, to really tell an engaging story and keep me engaged; in its current form, I didn't feel you were leading me firmly in the direction you wanted to go.

Happy to chat about ways in which I think you could improve it and make it more engaging, but I admit I haven't yet read the rest of the series, so it might be easier if we did that over a call if you were interested.

You also have some inconsistency in your header nesting that confused me slightly: under "What I've Learned" you have a few short h3-level paragraphs, and a few with bolded headings, then a longer part, but then you continue with longer sections that still all have h3-level headings - at that point are we still inside "What I've Learned" or should those be h2 level?

Oh, and finally, you'll need to update the post date when we reach the point of publishing it

csalt-scottlogic · 2026-05-12T15:14:59Z

+
+The distinction matters because the two approaches have fundamentally different blind spots:
+
+**It tests the deployed system at the behavioural layer, not the source code.** Unit tests verify `cancelBooking()` returns the right status code. The explorer found that the API returns 200 but the frontend doesn't update: the cancellation "succeeds" but the booking stays on screen as "confirmed." That bug lives in the *integration between* components that individual tests don't cover. Copilot knows your code; the explorer knows how SaaS platforms, RBAC, multi-tenant architectures, and booking systems *typically* behave, and flags departures from those patterns based on domain experience.


Is this a good example? In most cases, this bug could be caught by unit testing the behaviour of the UI component against a mocked API service rather than by needing an integration test.

csalt-scottlogic · 2026-05-12T15:17:58Z

+
+**It tests the deployed system at the behavioural layer, not the source code.** Unit tests verify `cancelBooking()` returns the right status code. The explorer found that the API returns 200 but the frontend doesn't update: the cancellation "succeeds" but the booking stays on screen as "confirmed." That bug lives in the *integration between* components that individual tests don't cover. Copilot knows your code; the explorer knows how SaaS platforms, RBAC, multi-tenant architectures, and booking systems *typically* behave, and flags departures from those patterns based on domain experience.
+
+**It finds what nobody thought to test.** A developer writing tests thinks: "admin can access `/admin/settings`, member cannot." The explorer, operating without that assumption, just tries navigating a member to every URL it discovers: including ones nobody explicitly listed as protected. It found tenant admin pages accessible to regular members via slug manipulation. No unit test existed for that path because it wasn't in anyone's threat model.


In my opinion, a good developer will have already considered that and coded against it; "security by obscurity is not enough" has been drilled into us all for years now. I don't honestly think most developers would make that assumption

csalt-scottlogic · 2026-05-12T15:27:16Z

+
+Here's what that looks like in practice. The BookingPlatform's dev team provided a complexity map: 22 pages, 91 API handlers across 5 roles, 9 feature flags, a payment engine, a content CMS, and 18 forms with 80+ fields. The full role-boundary test surface, every API handler probed from every role, is 455 combinations. A developer or architect can describe this complexity. They know the module boundaries, the entity count, the integration points. That's a technical complexity map.
+
+But technical complexity alone doesn't set the testing budget. The tester adds the business priority overlay: payments is where the revenue comes from, so even moderate technical complexity gets maximum iterations. Content CMS is technically richer than user management, but for this client the CMS is low-risk while user suspension has regulatory implications. The explorer's April 28 run against the conrol app covered roughly 25–30% of that surface in 48 minutes for $0.29. The dev team's complexity map tells you *what exists*. The tester's business overlay tells you *what matters*. Neither has the full picture alone, and the explorer's iteration budgets are where those two perspectives meet.


Typo: "conrol" should be "control" (I'm assuming)

Add blog post: 2026-04-29-Building-an-AI-Powered-Exploratory-Tester-P…

cc524f7

…art-1.md

csalt-scottlogic requested changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog post: 2026-04-29-Building-an-AI-Powered-Exploratory-Tester-P…#425

Add blog post: 2026-04-29-Building-an-AI-Powered-Exploratory-Tester-P…#425
Trimble006 wants to merge 1 commit into
ScottLogic:gh-pagesfrom
Trimble006:post/2026-04-29-building-an-ai-powered-exploratory-tester-part-1

Trimble006 commented May 12, 2026

Uh oh!

csalt-scottlogic left a comment •

edited

Loading

Uh oh!

csalt-scottlogic May 12, 2026

Uh oh!

csalt-scottlogic May 12, 2026

Uh oh!

csalt-scottlogic May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The distinction matters because the two approaches have fundamentally different blind spots:

		It tests the deployed system at the behavioural layer, not the source code. Unit tests verify `cancelBooking()` returns the right status code. The explorer found that the API returns 200 but the frontend doesn't update: the cancellation "succeeds" but the booking stays on screen as "confirmed." That bug lives in the integration between components that individual tests don't cover. Copilot knows your code; the explorer knows how SaaS platforms, RBAC, multi-tenant architectures, and booking systems typically behave, and flags departures from those patterns based on domain experience.


		It tests the deployed system at the behavioural layer, not the source code. Unit tests verify `cancelBooking()` returns the right status code. The explorer found that the API returns 200 but the frontend doesn't update: the cancellation "succeeds" but the booking stays on screen as "confirmed." That bug lives in the integration between components that individual tests don't cover. Copilot knows your code; the explorer knows how SaaS platforms, RBAC, multi-tenant architectures, and booking systems typically behave, and flags departures from those patterns based on domain experience.

		It finds what nobody thought to test. A developer writing tests thinks: "admin can access `/admin/settings`, member cannot." The explorer, operating without that assumption, just tries navigating a member to every URL it discovers: including ones nobody explicitly listed as protected. It found tenant admin pages accessible to regular members via slug manipulation. No unit test existed for that path because it wasn't in anyone's threat model.


		Here's what that looks like in practice. The BookingPlatform's dev team provided a complexity map: 22 pages, 91 API handlers across 5 roles, 9 feature flags, a payment engine, a content CMS, and 18 forms with 80+ fields. The full role-boundary test surface, every API handler probed from every role, is 455 combinations. A developer or architect can describe this complexity. They know the module boundaries, the entity count, the integration points. That's a technical complexity map.

		But technical complexity alone doesn't set the testing budget. The tester adds the business priority overlay: payments is where the revenue comes from, so even moderate technical complexity gets maximum iterations. Content CMS is technically richer than user management, but for this client the CMS is low-risk while user suspension has regulatory implications. The explorer's April 28 run against the conrol app covered roughly 25–30% of that surface in 48 minutes for $0.29. The dev team's complexity map tells you what exists. The tester's business overlay tells you what matters. Neither has the full picture alone, and the explorer's iteration budgets are where those two perspectives meet.

Conversation

Trimble006 commented May 12, 2026

Uh oh!

csalt-scottlogic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csalt-scottlogic May 12, 2026

Choose a reason for hiding this comment

Uh oh!

csalt-scottlogic May 12, 2026

Choose a reason for hiding this comment

Uh oh!

csalt-scottlogic May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csalt-scottlogic left a comment •

edited

Loading