Skip to content

A sophisticated tool that leverages RAG (Retrieval Augmented Generation) to analyze penetration testing data and generate comprehensive security reports.

License

Notifications You must be signed in to change notification settings

Abhinandan-Khurana/rag-based-ai-pentest-report-generator

Repository files navigation

RAG-based Penetration Testing Report Generator 🛡️

Python Version LLM Embeddings Database License

A sophisticated tool that leverages RAG (Retrieval Augmented Generation) to analyze penetration testing data and generate comprehensive security reports.

🎯 Problem Statement

Traditional penetration testing report generation is:

  • Time-consuming and labor-intensive
  • Prone to inconsistencies in reporting format
  • Challenging to maintain standardization across reports
  • Difficult to process large volumes of test data efficiently

💡 Solution

This tool provides:

  • Automated analysis of penetration testing data
  • Standardized report generation using RAG
  • Parallel processing of large datasets
  • Intelligent caching and optimization
  • Structured output with comprehensive security insights

🚀 Features

  • Parallel Document Processing: Efficiently handles multiple files simultaneously
  • Smart Caching: Implements document hashing for faster subsequent runs
  • Vector Storage: Uses ChromaDB for efficient similarity search
  • Memory Optimization: Batch processing with automatic memory management
  • Device Adaptation: Supports CUDA, MPS (M-series chips), and CPU
  • Comprehensive Reports: Generates detailed reports with:
    • Executive Summary
    • Methodology
    • Findings and Vulnerabilities
    • Risk Analysis
    • Technical Details
    • Remediation Steps

🛠️ Technical Architecture

Components

  1. Document Processor:

    • Text cleaning and normalization
    • Intelligent chunking
    • Metadata extraction
  2. Vector Store:

    • ChromaDB integration
    • Efficient similarity search
    • Persistent storage
  3. LLM Integration:

    • Ollama model integration
    • Customizable prompts
    • Streaming support
  4. Embedding System:

    • HuggingFace embeddings
    • Model: "BAAI/bge-small-en-v1.5"
    • Device-specific optimization

⚙️ Installation

# Clone the repository
git clone https://github.com/Abhinandan-Khurana/rag-based-ai-pentest-report-generator.git

# Install dependencies
pip install -r requirements.txt

# Install Ollama (Mac/Linux)
curl https://ollama.ai/install.sh | sh

# Pull required model
ollama pull deepseek-r1:latest

🗂️ Usage

python pentest_analyzer.py

When prompted, provide the path to your penetration testing data directory. (if running locally)


When prompted, provide the path data to your penetration testing data directory, after mounting that path to docker container, like in DockerSetup.md as demo_data directory is mounted. (if running in docker)

⚠️ Important Notes

  1. Variability in Results:

    • Due to the nature of RAG and LLM responses, results may vary between runs
    • The tool might generate slightly different analyses for the same input
    • It's recommended to run multiple analyses for critical assessments
  2. Supported File Types:

    • .txt
    • .json
    • .md
    • .xml
    • .csv
    • .log

🔍 Example Output Structure

📑 Generated Report
├── 1. Executive Summary
├── 2. Methodology
├── 3. Findings and Vulnerabilities
├── 4. Risk Analysis
├── 5. Detailed Technical Analysis
└── 6. Remediation Roadmap

Sequential Diagram Flow for the project

sequenceDiagram
    participant User
    participant Main
    participant PentestAnalyzer
    participant DocumentProcessor
    participant ChromaDB
    participant LLM
    participant FileSystem

    User->>Main: Start Program
    Main->>PentestAnalyzer: Initialize(config)
    PentestAnalyzer->>PentestAnalyzer: setup_logging()
    PentestAnalyzer->>PentestAnalyzer: setup_device()
    PentestAnalyzer->>PentestAnalyzer: initialize_models()

    Main->>User: Request Input Directory
    User->>Main: Provide Directory Path
    Main->>PentestAnalyzer: process_and_analyze(directory)

    PentestAnalyzer->>FileSystem: parallel_document_loading()
    activate FileSystem
    FileSystem-->>DocumentProcessor: Process Files
    DocumentProcessor->>DocumentProcessor: clean_text()
    DocumentProcessor->>DocumentProcessor: chunk_text()
    FileSystem-->>PentestAnalyzer: Return Documents
    deactivate FileSystem

    PentestAnalyzer->>ChromaDB: create_or_load_index()
    activate ChromaDB
    ChromaDB->>ChromaDB: get_or_create_collection()
    ChromaDB-->>PentestAnalyzer: Return Index
    deactivate ChromaDB

    PentestAnalyzer->>LLM: generate_report()
    activate LLM
    LLM->>LLM: Process Sections
    Note over LLM: Executive Summary
    Note over LLM: Methodology
    Note over LLM: Findings
    Note over LLM: Risk Analysis
    Note over LLM: Technical Analysis
    Note over LLM: Remediation
    LLM-->>PentestAnalyzer: Return Report
    deactivate LLM

    PentestAnalyzer->>FileSystem: Save Report
    PentestAnalyzer-->>Main: Return Status
    Main->>User: Display Completion Message

Important Notes

  • Results may be redundant or vary between runs due to the nature of RAG based LLM responses
  • Local LLMs may provide different results compared to OpenAI's models

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

👨‍💻 Author

Abhinandan Khurana