Skip to content

psm-defense/psm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Sensitivity Minimization (PSM)

A black-box optimization approach for protecting Large Language Model (LLM) system prompts against adversarial extraction attacks.

Overview

This repository implements Prompt Sensitivity Minimization (PSM), a defense mechanism that uses LLM-as-optimizer to automatically generate protective "shields" around system prompts. PSM minimizes observable leakage while preserving task utility through an iterative optimization process.

Key Features

  • LLM-as-Optimizer: Uses language models to generate and refine protective shields
  • Multi-objective Optimization: Balances leakage minimization with utility preservation
  • Comprehensive Evaluation: Tests against multiple attack types and defense strategies
  • Smooth Aggregation: Implements various leakage score aggregation methods (logsumexp, top-k mean, max)
  • Efficient Caching: Reduces redundant API calls through intelligent result caching
  • Multiple Metrics: Supports both ROUGE-based and LLM judge evaluation metrics

Table of Contents

Installation

Prerequisites

  • Python 3.8+
  • Virtual environment (recommended)

Setup

  1. Clone the repository:
git clone <repository-url>
cd psm
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure API keys: Create a config/config.yaml file with your LLM API configurations:
llms:
  gpt-4o-mini:
    model: "gpt-4o-mini"
    model_provider: "openai"
    api_key: "your-api-key"
    temperature: 0.7
  # Add other model configurations as needed

Quick Start

Running PSM Defense

Generate protected system prompts using PSM:

python run.py

This will:

  1. Load system prompts from data/victim_prompts/
  2. Generate protective shields using LLM optimization
  3. Save optimized prompts to data/defense_prompts/

Evaluating Defenses

Evaluate defense mechanisms against adversarial attacks:

python experiments/evaluate_defenses.py

This will:

  1. Test multiple defense strategies (None, Baseline, PSM, N-gram Filter, FAKE)
  2. Run various attack types (RACCON, LIANG, ZHANG, etc.)
  3. Generate comprehensive results in results/

Usage

Programmatic Usage

Running PSM for a Single Prompt

from run import PSMRunner, RunPSMConfig

# Configure PSM
config = RunPSMConfig(
    dataset_name="unnatural-test",
    attack_samples=50,
    validation_samples=10,
    target_dataset_samples=30,
    llm_target="gpt-5-mini",
    llm_optimizer="gpt-4o-mini",
    llm_validation="gpt-4o",
    llm_judge="gpt-4o-mini",
    n_optimization_iterations=10,
    n_initial_shields=5,
    n_shields_per_step=5,
    max_candidates_per_step=10,
    target_utility_threshold=0.9,
    target_leakage_threshold=0.65,
    leakage_aggregation="logsumexp",
    logsumexp_temperature=10.0
)

# Run PSM
runner = PSMRunner(config)
runner.run_experiments()

Evaluating a Defense Strategy

from experiments.evaluate_defenses import (
    DefenseEvaluator, ExperimentConfig, 
    DefenseConfig, DefenseStrategy, 
    AttackType, Dataset, TargetModel
)

# Configure experiment
config = ExperimentConfig(
    attack_types=[AttackType.RACCON, AttackType.LIANG],
    defense_configs=[
        DefenseConfig(DefenseStrategy.PSM, Dataset.UNNATURAL_TEST, TargetModel.GPT_4O_MINI)
    ],
    max_victim_samples=30,
    max_attack_samples=60,
    use_am_metric=True,
    am_threshold=0.9,
    use_jm_metric=True,
    jm_threshold=0.7
)

# Run evaluation
evaluator = DefenseEvaluator(config)
await evaluator.run_experiments()
evaluator.save_results()
evaluator.print_results_summary()

Architecture

Core Components

1. PSM (src/defense/psm/psm.py)

The main optimization engine that:

  • Generates initial protective shields
  • Iteratively refines shields using LLM-as-optimizer
  • Evaluates candidates on utility and leakage metrics
  • Optimizes for multi-objective fitness function

Key Classes:

  • PSM: Main optimization class
  • PromptCandidate: Represents a candidate solution with scores
  • PSMConfig: Configuration parameters

2. Data Creation (src/defense/psm/psm_data_creation.py)

Handles:

  • Generation of baseline validation inputs
  • Attack input collection
  • Query-answer pair creation

3. Metrics (src/metrics.py)

Evaluation metrics:

  • ROUGE Recall: Measures leakage based on token overlap
  • LLM Judge: Binary evaluation using LLM as judge
  • Approximate Match: Threshold-based success detection

4. Evaluation (experiments/evaluate_defenses.py)

Comprehensive evaluation framework:

  • Supports multiple defense strategies
  • Tests various attack types
  • Implements result caching
  • Generates detailed reports

Defense Strategies

  1. None: No defense (baseline)
  2. Baseline: Simple instruction addition (Liang et al., 2024)
  3. PSM: Prompt Sensitivity Minimization (this work)
  4. N-gram Filter: Filters responses containing prompt n-grams
  5. FAKE: Decoy prompt insertion (Liang et al., 2024)

Attack Types

Supported attack datasets:

  • raccon.json: RACCON attack prompts
  • liang.json: Liang et al. attack patterns
  • zhang.json: Zhang et al. attack patterns
  • ours.json: Custom attack prompts
  • raccon_language.json: Language-specific RACCON attacks

Configuration

PSM Configuration Parameters

@dataclass
class RunPSMConfig:
    # Dataset
    dataset_name: str = "unnatural-test"
    
    # Sample sizes
    attack_samples: int = 50              # Attack inputs to use
    validation_samples: int = 10          # Baseline validation queries
    target_dataset_samples: int = 30      # System prompts to process
    
    # Model selection
    llm_target: str = "gpt-5-mini"       # Target model to protect
    llm_optimizer: str = "gpt-4o-mini"   # Model for shield generation
    llm_validation: str = "gpt-4o"       # Model for validation
    llm_judge: str = "gpt-4o-mini"       # Model for judging
    
    # Optimization parameters
    n_optimization_iterations: int = 10  # Max optimization iterations
    n_initial_shields: int = 5           # Initial shield population
    n_shields_per_step: int = 5          # New shields per iteration
    max_candidates_per_step: int = 10    # Max candidates to consider
    
    # Thresholds
    target_utility_threshold: float = 0.9    # Minimum utility score
    target_leakage_threshold: float = 0.65  # Maximum leakage score
    utility_penalty_weight: float = 100.0   # Weight for utility penalty
    
    # Leakage aggregation
    leakage_aggregation: str = "logsumexp"   # Aggregation method
    logsumexp_temperature: float = 10.0     # LogSumExp temperature
    top_k: int = 3                          # Top-k for top_k_mean

Leakage Aggregation Methods

  1. max: Maximum leakage score (non-smooth)
  2. logsumexp: Smooth approximation of max (recommended)
  3. top_k_mean: Average of top-k worst scores
  4. mean: Simple average (smoothest)

Experiment Configuration

config = ExperimentConfig(
    # Attack and defense selection
    attack_types=[AttackType.RACCON, AttackType.LIANG],
    defense_configs=[...],
    
    # Sample limits
    max_victim_samples=30,
    max_attack_samples=60,
    
    # AM metric
    use_am_metric=True,
    am_threshold=0.9,
    
    # JM metric
    use_jm_metric=True,
    jm_threshold=0.7,
    
    # Output options
    output_dir="results",
    save_detailed_results=True,
    save_per_prompt_results=True
)

Reproducibility

This section provides detailed instructions for reproducing the experimental results from this repository.

Configuration

  1. Set up API Keys: Create config/config.yaml with your LLM API configurations:
llms:
  gpt-4o-mini:
    model: "gpt-4o-mini"
    model_provider: "openai"
    api_key: "your-api-key"
    temperature: 0
  gpt-4o:
    model: "gpt-4o"
    model_provider: "openai"
    api_key: "your-api-key"
    temperature: 0
  gpt-5-mini:
    model: "gpt-5-mini"
    model_provider: "openai"
    api_key: "your-api-key"
    temperature: 0
  sentence-transformers/all-MiniLM-L6-v2:
    model_provider: "huggingface"
  1. Verify Data Files: Ensure the following data files exist:
  • data/victim_prompts/unnatural-test.jsonl
  • data/victim_prompts/syntentic-system-prompt.jsonl
  • data/attack_prompts/raccon.json
  • data/attack_prompts/raccon_language.json
  • data/attack_prompts/liang.json
  • data/attack_prompts/zhang.json

Reproducing PSM Defense Generation

Run the PSM optimization process:

python run.py

Expected Output:

  • Processed prompts saved to data/defense_prompts/
  • Filename format: psm_target_{model}_dataset_{dataset}.jsonl
  • Each entry contains:
    • instruction: Optimized prompt with protective shield
    • original_instruction: Original system prompt
    • utility_score: Utility preservation score
    • leakage_score: Leakage minimization score
    • fitness_score: Combined fitness score

Default Configuration:

  • n_optimization_iterations: 10
  • n_initial_shields: 5
  • n_shields_per_step: 5
  • attack_samples: 50
  • validation_samples: 10
  • leakage_aggregation: "logsumexp"

Reproducing Defense Evaluation

Evaluate defenses against attacks:

python experiments/evaluate_defenses.py

Expected Output:

  • Results saved to results/ directory
  • Excel files with detailed attack results
  • JSON summary of all experiments
  • Cached results for faster subsequent runs

Reproducing Specific Experiments

For programmatic control, modify the configuration in run.py:

config = RunPSMConfig(
    dataset_name="unnatural-test",
    attack_samples=50,
    validation_samples=10,
    target_dataset_samples=30,
    llm_target="gpt-5-mini",
    llm_optimizer="gpt-4o-mini",
    llm_validation="gpt-4o",
    llm_judge="gpt-4o-mini",
    n_optimization_iterations=10,
    leakage_aggregation="logsumexp",
    logsumexp_temperature=10.0
)

About

A black-box optimization approach for protecting Large Language Model (LLM) system prompts against adversarial extraction attacks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages