Skip to content

glideapps/security-bench

Repository files navigation

Security Benchmark Suite

A comprehensive evaluation framework for testing Large Language Models' ability to identify security vulnerabilities in code, with a focus on SQL injection, access control, and multi-tenant data isolation issues.

Overview

This benchmark suite evaluates how well LLMs can detect security vulnerabilities in database queries and other code fragments. It tests various security patterns including:

  • SQL injection vulnerabilities
  • Cross-tenant data leakage
  • Missing access control checks
  • Improper handling of soft-deleted records
  • Exposure of sensitive fields
  • Missing pagination limits
  • Incorrect permission validations
  • State transition guards
  • Temporal access control

Setup

Prerequisites

  • Node.js 24+ (with built-in TypeScript support)
  • An OpenRouter API key

Installation

  1. Clone the repository
  2. Install dependencies:
    npm install
  3. Configure your OpenRouter API key in .env:
    OPENROUTER_API_KEY=your-api-key-here
    

Running Evaluations

Evaluation Prompts

The system supports multiple evaluation prompts to test different prompting strategies. Prompts are stored in the eval-prompts/ directory:

  • default.txt - Standard evaluation prompt covering common vulnerabilities
  • chain-of-thought.txt - Step-by-step analysis prompt with 10 structured steps
  • concise.txt - Minimal prompt for quick evaluation (1-2 sentence explanations)
  • rule-by-rule.txt - Analyzes code against each individual security rule from the specification
  • security-checklist.txt - Uses a checklist approach marking each control as PASS/FAIL/N.A.
  • attacker-mindset.txt - Evaluates code from an attacker's perspective looking for exploits

When you run evaluations, the system automatically iterates through ALL prompts in the eval-prompts/ directory. Each model × prompt combination is evaluated and tracked separately.

Creating Custom Evaluation Prompts

To add your own evaluation prompts:

  1. Create a new .txt file in the eval-prompts/ directory
  2. Write your prompt that will be sent after the schema/requirements
  3. End with "Here is the code to evaluate:" or similar
  4. The prompt must request JSON output with explanation and assessment fields (in that order)

Example custom prompt (eval-prompts/security-focused.txt):

Focus specifically on authentication and authorization vulnerabilities.
Check if the query properly validates user identity and permissions.
Look for any way an attacker could bypass access controls.

Respond with a JSON object:
- "explanation": describe any authorization issues found
- "assessment": "good" if secure, "bad" if vulnerable

Here is the code to evaluate:

Evaluate a Model

Run the full benchmark suite against a specific model (uses all prompts automatically):

npm run evaluate -- --model gpt-4o-mini

# Force re-evaluation even if results already exist
npm run evaluate -- --model gpt-4o-mini --force

Run evaluations with multiple models:

# Evaluate with multiple models sequentially
npm run evaluate -- --model gpt-4o --model claude-3-opus --model gemini-pro

# Multiple models with filter
npm run evaluate -- --model gpt-4o-mini --model claude-3-haiku --filter approve-po

Run a filtered subset of tests:

# Only run tests matching a pattern
npm run evaluate -- --model claude-3-opus --filter approve-po

# Only run "good" test cases
npm run evaluate -- --model gpt-4o-mini --filter 01-good

Note: When using multiple models, each model is evaluated sequentially with clear separation in the output. All results are stored independently in the database for each model.

Generate Reports

Generate an HTML report from evaluation results:

npm run report

Reports are saved in the reports/ directory with timestamps. Each report includes:

  • Summary statistics showing overall accuracy percentages
  • Number of evaluation prompts used and their names
  • Results grouped by model with individual test file links
  • Results grouped by application showing performance across all models
  • Detailed pages for each test file showing all evaluations with their prompts
  • Each evaluation shows which prompt was used

Fix Test Files

The fix command can automatically rewrite test code when a model's assessment differs from the expected result:

# Fix using default model (anthropic/claude-opus-4.1) and default prompt
npm run fix approve-po-01-good.md

# Fix using a specific model
npm run fix approve-po-01-good.md -- --model gpt-4o

# Fix using a specific prompt
npm run fix approve-po-01-good.md -- --prompt chain-of-thought.txt

# Fix using both specific model and prompt
npm run fix approve-po-01-good.md -- --model gpt-4o --prompt concise.txt

# Fix using full or relative path
npm run fix tests/purchase-order/approve-po-01-good.md

How it works:

  1. Retrieves any previous evaluation results for context (filtered by prompt)
  2. Deletes all existing database entries for that test file and prompt combination
  3. Evaluates the test file with the specified model and prompt
  4. If the assessment matches expected: reports no fix needed
  5. If the assessment differs: asks the model to rewrite the code to match expectations
  6. Updates the test file with corrected code
  7. Verifies the fix by re-evaluating

The fix command will:

  • Make vulnerable code secure (if expected is "good")
  • Introduce realistic vulnerabilities (if expected is "bad")
  • Add appropriate SQL comments (accurate for secure code, misleading for vulnerable code)

Verify Query Behavior

The verify command runs all test queries against an actual database to ensure that:

  • "Good" query variants produce identical results
  • "Bad" queries expose vulnerabilities by producing different results
  • All queries are syntactically valid SQL
npm run verify

This command:

  1. Creates an in-memory PostgreSQL database using PGlite
  2. Loads test data with multiple organizations, users, and purchase orders
  3. For each query type with a parameter file:
    • Runs all good queries and verifies they return identical results
    • Runs bad queries and checks they differ from good queries
    • For INSERT/UPDATE/DELETE operations, uses optional Verify queries to check actual data changes
  4. Reports any queries that fail to expose vulnerabilities

Batch Fix with Autofix

The autofix command finds and fixes multiple test files based on their correctness percentage:

# Fix all files with ≤50% correctness using default model and prompt
npm run autofix -- 50

# Fix all files with 0% correctness (completely failing)
npm run autofix -- 0

# Use a specific model for autofix
npm run autofix -- 30 --model gpt-4o

# Use a specific prompt for autofix
npm run autofix -- 50 --prompt chain-of-thought.txt

# Use both specific model and prompt
npm run autofix -- 0 --model gpt-4o --prompt concise.txt

How it works:

  1. Queries the database for all test files with correctness ≤ the specified percentage (filtered by prompt if specified)
  2. Shows a summary of files to be fixed with their current accuracy
  3. For >10 files, prompts for confirmation (in interactive terminals only)
  4. Processes each file sequentially:
    • Shows progress indicator [n/total]
    • Calls the fix logic (including DB cleanup for that prompt)
    • Continues on error
  5. Displays final summary with success/failure counts

Example output:

Finding test files with ≤50% correctness...

Found 3 files to fix:
  - approve-po-03-bad.md (0.0% correct, 0/2)
  - buyer-approval-queue-04-bad.md (25.0% correct, 1/4)
  - get-messages-in-po-05-bad.md (50.0% correct, 1/2)

Starting autofix with model: anthropic/claude-opus-4.1

[1/3] Fixing approve-po-03-bad.md (0.0% correct)
...
[2/3] Fixing buyer-approval-queue-04-bad.md (25.0% correct)
...
[3/3] Fixing get-messages-in-po-05-bad.md (50.0% correct)
...

========================================
Autofix complete!
  ✓ Successfully fixed: 3
========================================

Query Verification System

The verification system ensures test queries are not only theoretically correct but actually behave as expected against a real database.

Creating Parameter Files

Each query type needs a parameter file (e.g., approve-po.md) in the test directory:

# Description
Query to approve a purchase order with proper security controls

# Parameters

{ "user_id": "22222222-0000-0000-0000-000000000001", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }
{ "user_id": "22222222-0000-0000-0000-000000000002", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_user", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }
{ "user_id": "22222222-0000-0000-0000-000000000005", "org_id": "11111111-0000-0000-0000-000000000002", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000001", "now": "2024-01-15T12:00:00Z" }

# Verify (Optional - for INSERT/UPDATE/DELETE)
```sql
SELECT id, status, approved_by, buyer_org_id
FROM purchase_orders 
WHERE id = :po_id

Each parameter set should test different access scenarios to expose vulnerabilities in bad queries.

### Database Setup

Create an `index.ts` file in the test directory to set up the database:

```typescript
import { PGlite } from '@electric-sql/pglite';

export async function createDatabase(): Promise<PGlite> {
  const db = new PGlite();
  
  // Create schema
  await db.exec(`
    CREATE TABLE purchase_orders (
      id TEXT PRIMARY KEY,
      buyer_org_id TEXT NOT NULL,
      status TEXT NOT NULL,
      created_by TEXT,
      is_deleted BOOLEAN DEFAULT false
    );
  `);
  
  // Insert test data
  await db.exec(`
    INSERT INTO purchase_orders VALUES
      ('44444444-0000-0000-0000-000000000001', '11111111-0000-0000-0000-000000000001', 'DRAFT', '22222222-0000-0000-0000-000000000002', false),
      ('44444444-0000-0000-0000-000000000002', '11111111-0000-0000-0000-000000000001', 'PENDING_APPROVAL', '22222222-0000-0000-0000-000000000002', false);
  `);
  
  return db;
}

Verification Process

The verify command:

  1. Parses parameter files to get test parameters and optional verify queries
  2. Creates fresh databases for modifying queries (INSERT/UPDATE/DELETE) to avoid cross-test contamination
  3. Converts named parameters (:param) to positional parameters ($1) for PGlite compatibility
  4. Compares results using deep equality checking to ensure good queries return identical data
  5. Reports vulnerabilities when bad queries produce different results or errors

Key Features

  • Deep equality checking: Uses fast-deep-equal to compare actual query results, not just row counts
  • Fresh databases for mutations: Each INSERT/UPDATE/DELETE gets a clean database to prevent test pollution
  • Verify queries: Optional SELECT queries to check actual data changes after mutations
  • Comprehensive reporting: Shows which parameter sets expose vulnerabilities and which don't

Adding New Benchmarks

Directory Structure

Each benchmark application lives in tests/<app-name>/ with:

  • SPEC.md - Application specification with schema and security requirements
  • Individual test files following the naming pattern: <query-name>-<01-06>-<good|bad>.md

Creating a New Application Benchmark

  1. Create the application directory:

    mkdir tests/my-new-app
  2. Write the SPEC.md file:

    # Application Name
    
    Description of the application...
    
    # Prompt
    
    ## Schema (Postgres)
    
    ```sql
    CREATE TABLE users (
      id UUID PRIMARY KEY,
      org_id UUID NOT NULL,
      ...
    );

    Security Requirements

    1. All queries must filter by organization ID
    2. Soft-deleted records must be excluded
    3. ...
    
    
  3. Create test cases following the naming convention:

    • 2 good examples: query-name-01-good.md, query-name-02-good.md
    • 4 bad examples: query-name-03-bad.md through query-name-06-bad.md

Test File Format

Each test file must follow this structure:

# Description
Explanation of what this test case validates or the vulnerability it contains.

# Code
```sql
-- SQL query or code fragment
SELECT * FROM users WHERE id = $1;

Expected

good


Or for vulnerable code:

```markdown
# Description
This query is missing tenant isolation, allowing cross-tenant data access.

# Code
```sql
-- SAFE: User lookup query
SELECT * FROM users WHERE id = $1;

Expected

bad


### Guidelines for Test Cases

1. **Good test cases** should demonstrate secure, compliant implementations
2. **Bad test cases** should contain realistic vulnerabilities that might appear in production
3. Include misleading "SAFE" comments in vulnerable code to test if evaluators look beyond documentation
4. Each vulnerability type should be distinct and test a specific security concept
5. Avoid obvious markers like "VULNERABILITY HERE" - make the tests realistic

### Supporting Multiple Languages

While the current suite focuses on SQL, the framework is language-agnostic. To add tests for other languages:

1. Use the same directory structure and file format
2. Update the code blocks with the appropriate language identifier
3. Adjust the security requirements in SPEC.md accordingly

Example for JavaScript:

```markdown
# Code
```javascript
// User authentication endpoint
app.get('/api/user/:id', (req, res) => {
  const user = db.query(`SELECT * FROM users WHERE id = '${req.params.id}'`);
  res.json(user);
});

## How It Works

1. **Test Discovery**: The evaluator scans the `tests/` directory for applications
2. **Prompt Loading**: The evaluator loads all `.txt` files from `eval-prompts/` directory
3. **Evaluation Loop**: For each model and each prompt:
   - Iterates through all test applications
   - Sends three separate messages per test:
     - The SPEC.md's Prompt section (schema and requirements)
     - The evaluation prompt from the current prompt file
     - The code fragment from the test file
4. **LLM Evaluation**: 
   - Sends the combined prompt to the specified model via OpenRouter
   - Uses structured JSON output for consistent responses
   - Processes up to 5 concurrent API requests per application
5. **Result Storage**: Stores the model's assessment, explanation, and prompt filename in SQLite
6. **Reporting**: Generates HTML reports with:
   - Overall statistics and accuracy percentages
   - Evaluation prompt information
   - Results grouped by model and application
   - Individual pages for each test file with full details including prompts used

## Database Schema

Results are stored in `results.db` with the following schema:

| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER | Primary key |
| timestamp | DATETIME | When the evaluation was run |
| test_file | TEXT | Full path to the test file |
| model_name | TEXT | Name of the model used |
| eval_prompt | TEXT | Prompt file used (e.g., 'default.txt', 'chain-of-thought.txt') |
| expected_result | TEXT | Expected result ("good" or "bad") |
| actual_result | TEXT | Model's assessment |
| explanation | TEXT | Model's reasoning |
| request | TEXT | Complete API request body (JSON) |
| response | TEXT | Complete API response body (JSON) |

Key features:
- **Deduplication**: The evaluator checks for existing results (by model, file, and prompt) before making API calls
- **Full audit trail**: Request/response bodies are stored for debugging
- **Cleanup on fix**: The fix and autofix commands delete entries for a specific file+prompt combination before rewriting
- **Multi-prompt support**: Each evaluation tracks which prompt was used, allowing comparison of prompting strategies

## Troubleshooting

- **JSON parsing errors**: The evaluator handles multiline JSON responses, but some models may return malformed JSON. Check the console output for details.
- **Rate limiting**: The evaluator implements exponential backoff for rate limits. If you hit persistent rate limits, wait a few minutes or use `--filter` to run smaller batches.
- **Missing prompts**: Ensure each application directory has a `SPEC.md` file with a `# Prompt` section.
- **"No content in OpenRouter response"**: Some models like `google/gemini-2.5-pro` use extensive reasoning that may exhaust the default token limit. The evaluator automatically sets 30,000 max tokens for these models.
- **Autofix confirmation prompts**: For safety, autofix requires confirmation when processing >10 files. Run in an interactive terminal or process smaller batches.

About

Evals for how well LLMs can judge security issues in code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published