Skip to content

zaydabash/toolbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prompt-eval

A behavioral evaluation system for LLM agents. Two tools in one: a real-time observability dashboard and a prompt regression suite with a CI/CD gate.

prompt-eval dashboard

What it does

  • Takes any LLM-powered app, auto-generates adversarial prompt variants, runs them, and clusters the outputs by meaning with HDBSCAN to detect behavioral drift (no string matching).
  • Surfaces four signals:
    • stability score: 1 - normalized entropy of the cluster-size distribution (how concentrated a run's behavior is).
    • inter-cluster collapse: a rise in mean cosine similarity between cluster centroids, i.e. the model losing discriminative capacity on a behavior dimension.
    • cluster count delta: a change in the number of behavioral modes between two runs (caught even when reassignment drift reads stable).
    • variant coverage: how widely the generated variants actually span the prompt space.
  • The regression suite compares runs before and after a prompt change using the Adjusted Rand Index over the same inputs, and fails the CI build on semantic regression, not a string diff.
  • Ships a GitHub Actions workflow (.github/workflows/prompt-gate.yml) that runs the gate on changes under prompts/.

Prompt gate (CI)

$ python backend/prompt_gate.py --baseline old.txt --candidate new.txt --inputs inputs.txt

Prompt behavior gate: FAIL
- reassignment drift: 1.000 (regression)
- behavioral modes: 4 -> 4 (delta +0)
- inter-cluster similarity: 0.285 -> 0.285 (delta -0.000)
behavior changes:
- shifted: 17 -> 19 outputs   "I can help with that. Here is a concise, direct answer..."
- shifted:  9 ->  7 outputs   "Sure, here is the information you asked for, summarized..."

The gate exits non-zero on a regression, so the CI build fails and the report is posted to the job summary.

Stack

FastAPI + React + Arize Phoenix + Context.ai, with a provider-agnostic LLM client (Claude or any OpenAI-compatible endpoint via a one-method protocol).

Integrations

  • Tracing (Arize Phoenix): launches with the backend and serves its UI at http://localhost:6006. The pipeline emits OpenTelemetry spans for variant generation, the agent runner, clustering, and drift detection.
  • Conversation analytics (Context.ai): set CONTEXT_AI_API_KEY and every variant output is logged as a conversation turn, with the eval run id as the thread id. It no-ops if the key is not set.

Research basis

The clustering approach and the inter-cluster collapse metric derive from semantic evaluation research: the inter/intra cluster similarity gap is used as a signal for a model losing discriminative capacity on a behavior dimension.

Run

cp backend/.env.example backend/.env          # set LLM_API_KEY and LLM_MODEL
cd backend && pip install -r requirements.txt && uvicorn api.main:app --reload --port 8000
cd frontend && npm install && npm run dev

Or docker compose up (backend :8000, frontend :5173, Phoenix :6006).

Test

cd backend && pytest --cov=core --cov=api

The suite is fully mocked: the LLM client and the embedder are injected, so no API key or model download is needed.

Layout

backend/   FastAPI app, core eval pipeline (clustering, scoring, gate), SQLite persistence
frontend/  React dashboard and regression UI
prompts/   prompt templates
docs/      API reference (docs/api.md) and design records (docs/adr/)

About

Behavioral eval dashboard and prompt regression suite for LLM agents, semantic clustering, drift detection, and a CI gate that fails on meaning changes, not string diffs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors