A black-box optimization approach for protecting Large Language Model (LLM) system prompts against adversarial extraction attacks.
This repository implements Prompt Sensitivity Minimization (PSM), a defense mechanism that uses LLM-as-optimizer to automatically generate protective "shields" around system prompts. PSM minimizes observable leakage while preserving task utility through an iterative optimization process.
- LLM-as-Optimizer: Uses language models to generate and refine protective shields
- Multi-objective Optimization: Balances leakage minimization with utility preservation
- Comprehensive Evaluation: Tests against multiple attack types and defense strategies
- Smooth Aggregation: Implements various leakage score aggregation methods (logsumexp, top-k mean, max)
- Efficient Caching: Reduces redundant API calls through intelligent result caching
- Multiple Metrics: Supports both ROUGE-based and LLM judge evaluation metrics
- Python 3.8+
- Virtual environment (recommended)
- Clone the repository:
git clone <repository-url>
cd psm- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure API keys:
Create a
config/config.yamlfile with your LLM API configurations:
llms:
gpt-4o-mini:
model: "gpt-4o-mini"
model_provider: "openai"
api_key: "your-api-key"
temperature: 0.7
# Add other model configurations as neededGenerate protected system prompts using PSM:
python run.pyThis will:
- Load system prompts from
data/victim_prompts/ - Generate protective shields using LLM optimization
- Save optimized prompts to
data/defense_prompts/
Evaluate defense mechanisms against adversarial attacks:
python experiments/evaluate_defenses.pyThis will:
- Test multiple defense strategies (None, Baseline, PSM, N-gram Filter, FAKE)
- Run various attack types (RACCON, LIANG, ZHANG, etc.)
- Generate comprehensive results in
results/
from run import PSMRunner, RunPSMConfig
# Configure PSM
config = RunPSMConfig(
dataset_name="unnatural-test",
attack_samples=50,
validation_samples=10,
target_dataset_samples=30,
llm_target="gpt-5-mini",
llm_optimizer="gpt-4o-mini",
llm_validation="gpt-4o",
llm_judge="gpt-4o-mini",
n_optimization_iterations=10,
n_initial_shields=5,
n_shields_per_step=5,
max_candidates_per_step=10,
target_utility_threshold=0.9,
target_leakage_threshold=0.65,
leakage_aggregation="logsumexp",
logsumexp_temperature=10.0
)
# Run PSM
runner = PSMRunner(config)
runner.run_experiments()from experiments.evaluate_defenses import (
DefenseEvaluator, ExperimentConfig,
DefenseConfig, DefenseStrategy,
AttackType, Dataset, TargetModel
)
# Configure experiment
config = ExperimentConfig(
attack_types=[AttackType.RACCON, AttackType.LIANG],
defense_configs=[
DefenseConfig(DefenseStrategy.PSM, Dataset.UNNATURAL_TEST, TargetModel.GPT_4O_MINI)
],
max_victim_samples=30,
max_attack_samples=60,
use_am_metric=True,
am_threshold=0.9,
use_jm_metric=True,
jm_threshold=0.7
)
# Run evaluation
evaluator = DefenseEvaluator(config)
await evaluator.run_experiments()
evaluator.save_results()
evaluator.print_results_summary()The main optimization engine that:
- Generates initial protective shields
- Iteratively refines shields using LLM-as-optimizer
- Evaluates candidates on utility and leakage metrics
- Optimizes for multi-objective fitness function
Key Classes:
PSM: Main optimization classPromptCandidate: Represents a candidate solution with scoresPSMConfig: Configuration parameters
Handles:
- Generation of baseline validation inputs
- Attack input collection
- Query-answer pair creation
Evaluation metrics:
- ROUGE Recall: Measures leakage based on token overlap
- LLM Judge: Binary evaluation using LLM as judge
- Approximate Match: Threshold-based success detection
Comprehensive evaluation framework:
- Supports multiple defense strategies
- Tests various attack types
- Implements result caching
- Generates detailed reports
- None: No defense (baseline)
- Baseline: Simple instruction addition (Liang et al., 2024)
- PSM: Prompt Sensitivity Minimization (this work)
- N-gram Filter: Filters responses containing prompt n-grams
- FAKE: Decoy prompt insertion (Liang et al., 2024)
Supported attack datasets:
raccon.json: RACCON attack promptsliang.json: Liang et al. attack patternszhang.json: Zhang et al. attack patternsours.json: Custom attack promptsraccon_language.json: Language-specific RACCON attacks
@dataclass
class RunPSMConfig:
# Dataset
dataset_name: str = "unnatural-test"
# Sample sizes
attack_samples: int = 50 # Attack inputs to use
validation_samples: int = 10 # Baseline validation queries
target_dataset_samples: int = 30 # System prompts to process
# Model selection
llm_target: str = "gpt-5-mini" # Target model to protect
llm_optimizer: str = "gpt-4o-mini" # Model for shield generation
llm_validation: str = "gpt-4o" # Model for validation
llm_judge: str = "gpt-4o-mini" # Model for judging
# Optimization parameters
n_optimization_iterations: int = 10 # Max optimization iterations
n_initial_shields: int = 5 # Initial shield population
n_shields_per_step: int = 5 # New shields per iteration
max_candidates_per_step: int = 10 # Max candidates to consider
# Thresholds
target_utility_threshold: float = 0.9 # Minimum utility score
target_leakage_threshold: float = 0.65 # Maximum leakage score
utility_penalty_weight: float = 100.0 # Weight for utility penalty
# Leakage aggregation
leakage_aggregation: str = "logsumexp" # Aggregation method
logsumexp_temperature: float = 10.0 # LogSumExp temperature
top_k: int = 3 # Top-k for top_k_mean- max: Maximum leakage score (non-smooth)
- logsumexp: Smooth approximation of max (recommended)
- top_k_mean: Average of top-k worst scores
- mean: Simple average (smoothest)
config = ExperimentConfig(
# Attack and defense selection
attack_types=[AttackType.RACCON, AttackType.LIANG],
defense_configs=[...],
# Sample limits
max_victim_samples=30,
max_attack_samples=60,
# AM metric
use_am_metric=True,
am_threshold=0.9,
# JM metric
use_jm_metric=True,
jm_threshold=0.7,
# Output options
output_dir="results",
save_detailed_results=True,
save_per_prompt_results=True
)This section provides detailed instructions for reproducing the experimental results from this repository.
- Set up API Keys:
Create
config/config.yamlwith your LLM API configurations:
llms:
gpt-4o-mini:
model: "gpt-4o-mini"
model_provider: "openai"
api_key: "your-api-key"
temperature: 0
gpt-4o:
model: "gpt-4o"
model_provider: "openai"
api_key: "your-api-key"
temperature: 0
gpt-5-mini:
model: "gpt-5-mini"
model_provider: "openai"
api_key: "your-api-key"
temperature: 0
sentence-transformers/all-MiniLM-L6-v2:
model_provider: "huggingface"- Verify Data Files: Ensure the following data files exist:
data/victim_prompts/unnatural-test.jsonldata/victim_prompts/syntentic-system-prompt.jsonldata/attack_prompts/raccon.jsondata/attack_prompts/raccon_language.jsondata/attack_prompts/liang.jsondata/attack_prompts/zhang.json
Run the PSM optimization process:
python run.pyExpected Output:
- Processed prompts saved to
data/defense_prompts/ - Filename format:
psm_target_{model}_dataset_{dataset}.jsonl - Each entry contains:
instruction: Optimized prompt with protective shieldoriginal_instruction: Original system promptutility_score: Utility preservation scoreleakage_score: Leakage minimization scorefitness_score: Combined fitness score
Default Configuration:
n_optimization_iterations: 10n_initial_shields: 5n_shields_per_step: 5attack_samples: 50validation_samples: 10leakage_aggregation: "logsumexp"
Evaluate defenses against attacks:
python experiments/evaluate_defenses.pyExpected Output:
- Results saved to
results/directory - Excel files with detailed attack results
- JSON summary of all experiments
- Cached results for faster subsequent runs
For programmatic control, modify the configuration in run.py:
config = RunPSMConfig(
dataset_name="unnatural-test",
attack_samples=50,
validation_samples=10,
target_dataset_samples=30,
llm_target="gpt-5-mini",
llm_optimizer="gpt-4o-mini",
llm_validation="gpt-4o",
llm_judge="gpt-4o-mini",
n_optimization_iterations=10,
leakage_aggregation="logsumexp",
logsumexp_temperature=10.0
)