prompt-eval

A behavioral evaluation system for LLM agents. Two tools in one: a real-time observability dashboard and a prompt regression suite with a CI/CD gate.

What it does

Takes any LLM-powered app, auto-generates adversarial prompt variants, runs them, and clusters the outputs by meaning with HDBSCAN to detect behavioral drift (no string matching).
Surfaces four signals:
- stability score: 1 - normalized entropy of the cluster-size distribution (how concentrated a run's behavior is).
- inter-cluster collapse: a rise in mean cosine similarity between cluster centroids, i.e. the model losing discriminative capacity on a behavior dimension.
- cluster count delta: a change in the number of behavioral modes between two runs (caught even when reassignment drift reads stable).
- variant coverage: how widely the generated variants actually span the prompt space.
The regression suite compares runs before and after a prompt change using the Adjusted Rand Index over the same inputs, and fails the CI build on semantic regression, not a string diff.
Ships a GitHub Actions workflow (.github/workflows/prompt-gate.yml) that runs the gate on changes under prompts/.

Prompt gate (CI)

$ python backend/prompt_gate.py --baseline old.txt --candidate new.txt --inputs inputs.txt

Prompt behavior gate: FAIL
- reassignment drift: 1.000 (regression)
- behavioral modes: 4 -> 4 (delta +0)
- inter-cluster similarity: 0.285 -> 0.285 (delta -0.000)
behavior changes:
- shifted: 17 -> 19 outputs   "I can help with that. Here is a concise, direct answer..."
- shifted:  9 ->  7 outputs   "Sure, here is the information you asked for, summarized..."

The gate exits non-zero on a regression, so the CI build fails and the report is posted to the job summary.

Stack

FastAPI + React + Arize Phoenix + Context.ai, with a provider-agnostic LLM client (Claude or any OpenAI-compatible endpoint via a one-method protocol).

Integrations

Tracing (Arize Phoenix): launches with the backend and serves its UI at http://localhost:6006. The pipeline emits OpenTelemetry spans for variant generation, the agent runner, clustering, and drift detection.
Conversation analytics (Context.ai): set CONTEXT_AI_API_KEY and every variant output is logged as a conversation turn, with the eval run id as the thread id. It no-ops if the key is not set.

Research basis

The clustering approach and the inter-cluster collapse metric derive from semantic evaluation research: the inter/intra cluster similarity gap is used as a signal for a model losing discriminative capacity on a behavior dimension.

Run

cp backend/.env.example backend/.env          # set LLM_API_KEY and LLM_MODEL
cd backend && pip install -r requirements.txt && uvicorn api.main:app --reload --port 8000
cd frontend && npm install && npm run dev

Or docker compose up (backend :8000, frontend :5173, Phoenix :6006).

Test

cd backend && pytest --cov=core --cov=api

The suite is fully mocked: the LLM client and the embedder are injected, so no API key or model download is needed.

Layout

backend/   FastAPI app, core eval pipeline (clustering, scoring, gate), SQLite persistence
frontend/  React dashboard and regression UI
prompts/   prompt templates
docs/      API reference (docs/api.md) and design records (docs/adr/)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
backend		backend
docs		docs
frontend		frontend
prompts		prompts
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prompt-eval

What it does

Prompt gate (CI)

Stack

Integrations

Research basis

Run

Test

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

prompt-eval

What it does

Prompt gate (CI)

Stack

Integrations

Research basis

Run

Test

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages