Automated Epistemic Governance & Intelligence System
A multi-session, multi-agent codebase audit system built on Claude Code. AEGIS deploys 12 senior engineering personas across 14 audit domains to produce epistemically rigorous, adversarially reviewed findings on any codebase.
curl -sSL https://raw.githubusercontent.com/ChristopherKahler/aegis/main/install.sh | bashThen run /aegis:audit in Claude Code.
Requires: Claude Code installed (
~/.claude/directory must exist)
- Identity
- What AEGIS Is
- Core Philosophy
- The 14 Audit Domains
- The Agent Team
- Execution Phases
- The Three Output Layers
- AEGIS Transform
- Intervention Levels
- The Transformation Model
- Transform Agent Team
- Transform Execution Phases
- Commands
- Change Risk in Remediation
- Safety & Liability Framework
- PAUL Integration
- Pattern Corpus & Feedback Loop
- The Formal Epistemic Schema
- Disagreement Resolution System
- Disagreement Visualization Model
- Reality Gap Framework
- Language-Specific Failure Models
- Tooling Stack
- Output Format
- Signal Categories
- Installation & Runtime
- Ecosystem
AEGIS is part of a broader Claude Code extension ecosystem:
| System | What It Does | Link |
|---|---|---|
| AEGIS | Multi-agent codebase auditing — diagnosis + controlled evolution | You are here |
| BASE | Builder's Automated State Engine — workspace lifecycle, health tracking, drift prevention | GitHub |
| CARL | Context Augmentation & Reinforcement Layer — dynamic rules loaded JIT by intent | GitHub |
| PAUL | Project orchestration — Plan, Apply, Unify Loop | GitHub |
| CC Strategic AI | Skool community — courses, community, live support | Skool |
AEGIS (/ee-jis/) — In Greek mythology, the aegis was the divine shield carried by Zeus and Athena. It represented protection under authority — not passive defense, but the active, authoritative safeguarding of what matters.
AEGIS is that shield for codebases.
The name encodes what the system actually does:
- Automated — Multi-agent, multi-phase, tool-augmented analysis requiring no manual orchestration
- Epistemic — Built on a formal schema for how knowledge is structured, challenged, and trusted under uncertainty
- Governance — Compliance, oversight, and the Principal Engineer's role as epistemic governor of the entire audit
- Intelligence — AI-powered domain expert agents producing findings no single human could match in breadth
- System — Not a tool, not a script. A coordinated system of systems with defined phases, roles, and accountability
AEGIS is a multi-session, multi-agent codebase audit system built on Claude Code. It deploys a team of senior engineering personas — each an expert in a specific domain — to conduct a comprehensive analysis of any application codebase.
It is not a linter. It is not a static analyzer. It is not a report generator.
It is an AI Principal Engineer — a machine that performs disciplined doubt.
AEGIS answers five fundamental questions that senior engineers ask when they walk into an unknown codebase:
- Can this system be trusted?
- Can it survive change?
- Can it scale?
- Can it be operated safely?
- Can new engineers understand it?
Every audit produces findings across all domains, uses structured epistemic reasoning, cross-validates through adversarial review, and synthesizes into actionable, severity-ranked reports.
Senior engineers don't just find bugs — they find future failures.
Three principles distinguish AEGIS from conventional code analysis:
-
Disciplined Doubt Over Coherent Confidence — Most AI systems optimize for helpfulness and clean narratives. AEGIS optimizes for correctness under uncertainty, asymmetric risk detection, and institutional memory of failure patterns.
-
The Principal Builds the Story. The Devil's Advocate Breaks It. Truth Lives in the Tension. — No finding survives without challenge. No conclusion is trusted without adversarial review. Disagreement is signal, not noise.
-
Evidence > Assumptions > Code > Documentation — The epistemic schema enforces a strict separation between observations, interpretations, and judgments. No shortcuts. No opinion soup.
AEGIS audits across 14 domains (0-13). Every audit produces findings in each domain — even if the finding is "no major issues found." Nothing is optional.
Must come first. Without this, every other audit is partially blind.
| Aspect | Details |
|---|---|
| Owner | Principal Engineer |
| Purpose | Establish what the system does, who uses it, what constraints exist |
What to audit:
- What problem is this system solving?
- Who are the users?
- What are success/failure criteria?
- What data does it handle?
- What are its failure modes?
- What is the business criticality?
- What constraints exist (regulatory, cost, latency, team size)?
- What is explicitly OUT of scope?
| Aspect | Details |
|---|---|
| Owner | Architect Agent |
| Why it matters | Architecture determines how easy bugs are to introduce, how expensive changes are, and whether scale is possible without rewrites |
What to audit:
- System boundaries and module responsibilities
- Dependency direction and layering consistency
- Coupling vs cohesion
- Domain modeling quality
- Data flow clarity
- Presence of clear architectural pattern (hexagonal, layered, microservices, etc.)
- Boundary enforcement vs violation
- "God modules" or "utility dumping grounds"
- Domain logic mixed with infrastructure
Key questions:
- Is there a clear architectural pattern?
- Are boundaries enforced or violated?
- Are "god modules" present?
- Is domain logic mixed with infrastructure?
| Aspect | Details |
|---|---|
| Owner | Data Engineer Agent |
| Why it matters | Most catastrophic failures are data bugs, not code bugs — corrupt state, irreversible migrations, silent data loss, inconsistent derived data |
What to audit:
- Data models and schemas
- Schema evolution and migrations
- Backward compatibility
- Referential integrity
- Eventual consistency guarantees
- State transitions and invariants
- Data loss risks
| Aspect | Details |
|---|---|
| Owner | Senior Application Engineer Agent |
| Why it matters | Most production incidents are boring logic bugs, not exotic failures |
What to audit:
- Logic errors and edge case handling
- Error propagation (swallowed, ignored, or properly handled)
- Concurrency correctness
- Input validation
- Data consistency
- Assumption documentation (documented or implicit)
- Invariant enforcement
- Retry safety (idempotency)
| Aspect | Details |
|---|---|
| Owner | Security Engineer Agent |
| Why it matters | Security failures are existential risks |
What to audit:
- Authentication and authorization (AuthN/AuthZ)
- Secrets handling
- Input sanitization
- Injection risks (SQL, XSS, command, etc.)
- Dependency vulnerabilities
- Cryptography misuse
- Supply chain risk
- Trust boundaries
- Least-privilege enforcement
- Sensitive data in logs
| Aspect | Details |
|---|---|
| Owner | Compliance Officer Agent |
| Why it matters | Compliance failures lead to fines, lawsuits, and shutdowns |
What to audit:
- PII handling and data classification
- Data retention policies
- Encryption at rest and in transit
- Audit logging
- Consent tracking
- Where personal data is stored
- How personal data is deleted
- Whether access is auditable
| Aspect | Details |
|---|---|
| Owner | Test Engineer Agent |
| Why it matters | Senior engineers don't ask "Do you have tests?" — they ask "Would these tests catch the most expensive failures?" |
What to audit:
- Test pyramid shape (unit vs integration vs e2e balance)
- Determinism vs flakiness
- What ISN'T tested (gaps)
- Mutation resistance
- Contract testing presence
- Tests as documentation
- Failure coverage (do tests cover failure paths, not just happy paths?)
| Aspect | Details |
|---|---|
| Owner | SRE Agent |
| Why it matters | Production systems fail constantly — good ones degrade gracefully |
What to audit:
- Failure handling patterns
- Retry strategies (bounded? with backoff?)
- Timeouts
- Circuit breakers
- Startup/shutdown safety
- What happens when dependencies fail
- Whether failures are noisy or silent
- State recoverability
| Aspect | Details |
|---|---|
| Owner | Performance Engineer Agent |
| Why it matters | Scaling failures are often design bugs, not hardware limits |
What to audit:
- Algorithmic complexity
- N+1 queries
- Caching strategy
- Resource usage patterns
- Async vs blocking behavior
- What grows with user count
- What grows with data size
- Where bottlenecks exist
- Whether backpressure is implemented
| Aspect | Details |
|---|---|
| Owner | Senior Application Engineer Agent |
| Why it matters | Maintenance cost dominates total lifecycle cost |
What to audit:
- Code smells and duplication
- Naming clarity
- Documentation accuracy
- Whether intent is obvious
- Whether tests are meaningful or superficial
- Tech debt interest accruing
| Aspect | Details |
|---|---|
| Owner | SRE Agent |
| Why it matters | Many "great codebases" fail in production because ops was ignored |
What to audit:
- CI/CD pipelines
- Rollback safety
- Feature flags
- Observability (logging, metrics, tracing)
- Ownership clarity
- Debuggability
- Local dev friction
- Can this be safely deployed on Friday?
- Can incidents be diagnosed quickly?
- Who owns what?
| Aspect | Details |
|---|---|
| Owner | Staff Engineer Agent |
| Why it matters | "How dangerous is it to touch this code?" predicts velocity decay, team burnout, and rewrite pressure |
What to audit:
- Change amplification (how many files does one change touch?)
- Refactor safety
- Blast radius analysis
- Modularity health
- How risky future changes are
| Aspect | Details |
|---|---|
| Owner | Staff Engineer Agent |
| Why it matters | Systems fail socially before they fail technically |
What to audit:
- Code authorship concentration
- Bus factor per module
- Tribal knowledge hotspots
- Documentation debt
- Review culture artifacts
- Knowledge silos
| Aspect | Details |
|---|---|
| Owner | Principal Engineer Agent |
| Why it matters | Senior engineers think in predictions: "This will break in 3 months," "This will fail at 10x traffic," "This is safe unless compliance changes" |
What to synthesize:
- Likelihood x impact for all findings
- Time-to-failure predictions
- "What breaks first" analysis
- Risk acceptance vs remediation recommendations
- Cross-domain emergent risks
AEGIS deploys a minimal-complete set of agent personas. Each exists because removing it would leave a blind spot. No overlap, no bloat.
Role: Epistemic governor of the entire audit
The Principal Engineer is NOT the best coder, architect, or the most knowledgeable in every domain. They are the person accountable for the correctness of collective reasoning.
Core mental models:
- Thinks in systems of systems — "What behavior emerges from these interactions?"
- Separates facts, interpretations, and judgments (enforces the epistemic schema)
- Actively manages uncertainty — budgets it rather than eliminating it
- Thinks in time horizons (immediate, near-term, long-term, hypothetical)
- Optimizes organizational attention, not code
Responsibilities:
- Define audit scope and non-goals (Phase 0)
- Calibrate severity scales
- Resolve cross-domain conflicts
- Synthesize narrative
- Forecast future failure
- Translate findings for multiple audiences
- Own the final call
- Explicitly respond to every Devil's Advocate critique
Must never:
- Introduce new raw findings late
- Re-run tools
- Argue minutiae
They reason, arbitrate, and narrate.
Active in: Phase 0 (Context), Phase 5 (Synthesis)
Domains: 1 (Architecture & System Design)
Evaluates structural patterns, boundaries, dependency direction, coupling, cohesion, and whether the architecture can support the system's actual requirements.
Domains: 2 (Data & State Integrity)
Evaluates data models, schema evolution, migrations, referential integrity, consistency guarantees, and state transition safety.
Domains: 4 (Security)
Evaluates AuthN/AuthZ, secrets handling, injection risks, supply chain, cryptography, trust boundaries, and attacker models.
Domains: 5 (Compliance Privacy & Governance)
Evaluates PII handling, data retention, encryption, audit logging, consent tracking, and regulatory exposure.
Domains: 3 (Correctness & Logic), 9 (Maintainability & Code Health)
Evaluates logic correctness, error handling, edge cases, idempotency, code smells, naming, duplication, and intent clarity.
Domains: 7 (Reliability & Resilience), 10 (Operability & DevEx)
Evaluates failure handling, retries, timeouts, circuit breakers, CI/CD, rollback safety, observability, and operational readiness.
Domains: 8 (Scalability & Performance)
Evaluates algorithmic complexity, N+1 queries, caching, resource usage, async behavior, bottlenecks, and backpressure.
Domains: 6 (Testing Strategy & Verification)
Evaluates test pyramid, determinism, mutation resistance, contract tests, failure coverage, and test-as-documentation quality.
Domains: 11 (Change Risk & Evolvability), 12 (Team Ownership & Knowledge Risk)
Evaluates change amplification, refactor safety, blast radius, bus factor, knowledge silos, and documentation debt. This is a synthesis-heavy role that draws on git history and social signals.
Purpose: Detect divergence between "code as written" and "system as run"
This agent often finds: "The audit is technically correct but operationally wrong." Audits config files, environment-specific behavior, feature flags, deployment manifests, runtime overrides, and kill switches.
See Reality Gap Framework for full details.
Purpose: Hunt collective blind spots
The Devil's Advocate is NOT a contrarian, NOT "the negative one," NOT "the security pessimist." They are the agent that hunts collective blind spots.
Why this role exists: Every audit naturally develops consensus gravity, optimism bias, tool bias ("the scanner didn't find anything"), and authority bias. The Devil's Advocate exists to break coherence.
Core mental models:
- Assume the model is wrong: "If this report is wrong, how would it be wrong?"
- Attack confidence, not just conclusions — target high-confidence claims, clean narratives, and areas with little disagreement
- Seek asymmetric failure: "What failure would be disproportionately damaging relative to how little we're talking about it?"
- Use inversion relentlessly: "Under what conditions does this become unsafe?"
- Do NOT propose solutions — solutions dilute the critique
Outputs:
- Most confident claim I distrust
- Least discussed but highest-impact risk
- Assumptions that must hold for conclusions to be true
- Evidence that was overweighted
- Evidence that was ignored or unavailable
- Alternate narrative that fits the data
Critical rule: If the Devil's Advocate panel is empty, the system is broken.
AEGIS Core executes in six diagnostic phases (0-5). Order matters. Phases 6-8 (the Transform pipeline) extend diagnosis into remediation — see Transform Execution Phases.
Agent: Principal Engineer
Establish intent, constraints, risk profile, and non-goals. Without this phase, all audits are shallow.
Inputs: Repository, documentation, README, deployment configs, any available business context.
Outputs: Audit scope document, threat model, risk profile, explicit non-goals.
Agents: Tool runners (non-reasoning heavy)
Run automated tools. Gather signals across six orthogonal dimensions. No opinions yet. Just evidence.
Signal dimensions:
- Structure — SonarQube, Semgrep, dependency graphs
- Behavior — Profilers, async analysis
- History — Git churn, file age vs modification frequency, author concentration
- Dependencies — Trivy, Syft, Grype, OpenSSF Scorecard
- Policy posture — Checkov, Gitleaks, Semgrep
- Runtime contracts — OpenAPI/gRPC schema validation, backward compatibility
Outputs: Normalized signal data tagged with severity, confidence, blast radius, and domain relevance.
Agents: Architect, Data Engineer, Security Engineer, Compliance Officer, Senior Application Engineer, SRE, Performance Engineer, Test Engineer
Each agent audits ONLY their assigned domains. Each receives the same Phase 1 evidence. Each produces independent findings using the formal epistemic schema.
Sessions can run in parallel. Each produces a structured findings file.
Agents: Staff Engineer, Reality Gap Analyst
Synthesis-heavy roles that draw on Phase 2 findings + git history + configuration analysis.
The Staff Engineer evaluates change risk and ownership risk. The Reality Gap Analyst checks for divergence between code-as-written and system-as-run.
Agent: Devil's Advocate Reviewer
Goal: Invalidate conclusions, not agree. A fresh session whose only job is to challenge assumptions, attack confidence, and surface what was missed.
The Devil's Advocate reads ALL domain findings and produces their critique.
Agent: Principal Engineer
The Principal reads all domain findings + Devil's Advocate critique. For every disagreement, they must explicitly respond. Silence is not allowed.
Produces the final AEGIS report:
- Executive Risk Summary
- Architecture Narrative
- Findings by Domain (severity-ranked)
- Cross-Validation Notes (disagreements and resolutions)
- Remediation Roadmap
- Long-Term Structural Risks
- "What Would Break First at 10x Scale"
AEGIS produces three distinct output layers. Each has a different purpose, different mutability, and different consumer.
Layer A — Diagnostic Artifact (Truth Layer)
The audit itself. Immutable, reproducible, epistemically versioned. Phases 0-5 produce Layer A. It is the forensic record of what was found, how it was found, and how confident the system is.
Layer A is never mutated. Once a finding is produced, it exists permanently in the audit record. Subsequent analysis may reinterpret it, but the original observation stands.
Layer B — Remediation Knowledge (Instruction Layer)
Derived from Layer A. Framework-specific. Pattern-based. Educational. This is where playbooks live.
Layer B answers: "Given what was found, how should it be fixed?" But not generically — parametrically. Every remediation is expressed at four layers of specificity (see The Transformation Model) and carries both human-readable markdown and machine-consumable structured data.
Layer B is derived. It cannot exist without Layer A. It cannot contradict Layer A. If a finding changes, Layer B artifacts that reference it must be regenerated.
Layer C — Change Orchestration (Execution Layer)
Dependency-aware execution plans. Change graphs. Risk scoring. Verification gates. This is operational sequencing, not documentation.
Layer C answers: "In what order should fixes be applied, with what safety checks, and what happens if something goes wrong?"
Layer C is the PAUL integration point. AEGIS Transform produces PAUL-ready project artifacts — phased remediation plans with dependency ordering, risk-scored task definitions, and verification gates. AEGIS does not execute changes. PAUL does.
The Pipeline:
Layer A (Diagnosis) → Layer B (Knowledge) → Layer C (Orchestration)
Immutable truth Derived instruction Operational planning
Phases 0-5 Phases 6-7 Phase 8
The Controlled Evolution Engine
AEGIS Core is a diagnosis system. It finds problems with epistemic rigor and forensic-grade traceability. But diagnosis alone is incomplete. A report that says "you have 47 findings" without actionable, risk-scored, dependency-ordered remediation is a paper tiger.
AEGIS Transform is the second-order reasoning system that converts diagnostic findings into AI-consumable transformation artifacts — playbooks, remediation plans, guardrails, and PAUL-integrated execution plans.
Core Principle: Diagnosis is decentralized. Intervention is centralized.
In the diagnostic pipeline, 12 agents work independently across 14 domains. Each agent audits its own territory. Independence prevents groupthink.
In the Transform pipeline, 5 agents coordinate intervention. Remediation cannot be done in isolation — a fix to the authentication system affects security, architecture, testing, and deployment. Transform agents see the full picture and orchestrate change holistically.
What Transform Is Not:
- It is not bolt-on post-processing. It is a second-order reasoning system with its own agents, workflows, and risk modeling.
- It is not auto-remediation. Transform proposes changes; humans (via PAUL) execute them.
- It is not generic advice. Transform produces parametric, framework-specific, project-contextualized remediation.
What Transform Consumes:
All Layer A outputs — findings, domain knowledge, confidence scores, disagreements, cross-validation notes. Transform agents have full visibility into the diagnostic record.
What Transform Produces:
- Layer B: Remediation knowledge (playbooks, patterns, guardrails, educational context)
- Layer C: Change orchestration (dependency graphs, risk scores, verification plans, PAUL projects)
Every Transform output is explicitly classified by intervention level. This is not optional. Intervention levels gate what the system is allowed to produce and how much confidence is required.
| Level | Name | Definition | Confidence Required |
|---|---|---|---|
| 1 | Suggesting | "Consider this pattern." Informational. No action implied. | Any |
| 2 | Planning | "Here's how to fix this." Structured plan. Human decides whether to act. | Medium+ |
| 3 | Authorizing | "This change is recommended with confidence X." Risk-scored, gate-checked. | High |
| 4 | Executing | "Apply this change." Only via PAUL with verification gates. Never auto-applied by AEGIS. | High + Low change risk |
Safety Principle: Default to the lowest intervention level that serves the user. Escalation requires evidence.
A finding with medium confidence gets Suggesting or Planning. Never Authorizing. A change with high blast radius stays at Planning even with high confidence. Intervention levels are a ratchet — evidence raises them, uncertainty lowers them.
Why This Matters:
Without explicit intervention levels, every AI code tool produces the same thing: confident-sounding suggestions with no accountability framework. AEGIS Transform distinguishes between "you might want to look at this" and "we are confident this change should be made, here is the risk assessment, and here is how to verify it worked."
Remediation knowledge is parametric, not generic. Generic advice ("use parameterized queries") is useless to a developer staring at a specific codebase. The 4-layer transformation model ensures every piece of remediation is grounded at every level of specificity.
The universal principle. Language-agnostic. Framework-agnostic.
Example: "Unbounded retries are dangerous. They amplify load during partial outages and can cause cascading failures."
The principle applied to a specific framework.
Example: "In Laravel, use retryUntil() with a deadline. In Express, use the p-retry library with exponential backoff. In Spring Boot, use @Retryable with maxAttempts and backoff."
The implementation pattern in the project's language.
Example: "PHP retry with exponential backoff: $delay = min($baseDelay * (2 ** $attempt), $maxDelay)"
The specific files, functions, and call sites in this codebase.
Example: "Files app/Services/PaymentGateway.php (line 47), app/Jobs/SyncInventory.php (line 112), and app/Http/Controllers/WebhookController.php (line 89) — these 3 call sites use unbounded retry() loops. Each needs exponential backoff with a circuit breaker."
A playbook that only contains Layer 1 is an essay. A playbook that contains all four layers is a work order. The difference between "you should fix your retries" and "here are the 3 files, here is the pattern for your framework, and here is how to verify it works" is the difference between a blog post and engineering guidance.
Transform deploys 5 specialized agents. Each exists because removing it would leave a gap in the remediation pipeline.
Design Principle: Remediation must be centralized. Domain agents diagnose independently. Transform agents coordinate intervention.
Role: Translates diagnosis into structured change plans.
Consumes: All findings + domain knowledge + framework context. Produces: Remediation playbooks at all 4 transformation layers. Change dependency graphs. Sequenced fix orders.
Why they exist: Someone has to synthesize 47 findings into "fix these 5 things in this order and the other 42 findings improve as side effects." The Remediation Architect identifies root causes, groups related findings, and sequences changes by dependency and impact.
Role: Scores blast radius, coupling, regression probability, and architectural tension for every proposed change.
Consumes: Change plan + codebase structure + git history + test coverage. Produces: Per-change risk scores. Risk-adjusted priority ordering. Change risk assessment reports.
Why they exist: Every fix introduces new risk. A change that fixes a security vulnerability but breaks 14 integration tests is not a net improvement. The Change Risk Modeler ensures that remediation doesn't create more problems than it solves.
Role: Explains fixes for AI-assisted developers.
Consumes: Remediation plan + framework context + project patterns. Produces: Educational context at all 4 transformation layers. Before/after examples. "Why this matters" explanations. Best-practice rationale.
Why they exist: The fastest-growing segment of developers uses AI assistants for code generation. They can implement a fix but may not understand why. Without pedagogical context, fixes get applied without understanding, and the same patterns recur. The Pedagogy Agent ensures that every fix teaches something.
Role: Writes project rules for future AI usage.
Consumes: Patterns + findings + project conventions.
Produces: .claude/CLAUDE.md rules. .cursorrules files. Linter configurations. Pre-commit hooks. Custom Semgrep rules.
Why they exist: The highest-leverage output of an audit isn't a report — it's a set of rules that prevent the same problems from recurring. The Guardrail Generator translates audit findings into machine-enforceable constraints.
Role: Defines verification plans — how to prove fixes work.
Consumes: Change plan + test infrastructure + deployment configuration. Produces: Per-change verification steps. Expected outcomes. Rollback criteria. Test specifications.
Why they exist: A fix without a verification plan is faith-based engineering. The Execution Validator ensures that every change can be proven correct before it's considered complete.
AEGIS Transform executes in three phases (6-8), extending the Core diagnostic pipeline (0-5).
Agents: Remediation Architect, Pedagogy Agent
Input: Complete Layer A diagnostic record (all findings, domain knowledge, disagreement resolutions, confidence scores).
Process:
- Remediation Architect groups findings by root cause and dependency
- Remediation Architect produces playbooks at all 4 transformation layers
- Pedagogy Agent enriches playbooks with educational context, before/after examples, and best-practice rationale
- Each playbook is classified by intervention level
Output: Layer B remediation playbooks (human-readable markdown + machine-consumable YAML). Pattern library updates.
Agents: Change Risk Modeler, Guardrail Generator
Input: Phase 6 playbooks + codebase structure + git history + test coverage.
Process:
- Change Risk Modeler scores every proposed change across 4 dimensions (blast radius, coupling risk, regression probability, architectural tension)
- Changes exceeding risk thresholds are flagged for downgrade (Authorizing → Planning) or rejection
- Guardrail Generator produces project rules from pattern analysis
- Final risk-adjusted priority ordering is established
Output: Risk-scored change plan. Generated guardrail files. Risk assessment report.
Agents: Execution Validator
Input: Risk-scored change plan from Phase 7 + test infrastructure + deployment configuration.
Process:
- Execution Validator defines verification steps for every proposed change
- System generates PAUL-compatible project artifacts:
PROJECT.md— Project definition with audit referenceROADMAP.md— Phased remediation plan with risk ordering- Phased plans with dependency sequencing, verification gates, and rollback criteria
- Risk scores and intervention levels are embedded in PAUL task definitions
Output: Layer C execution artifacts. Complete PAUL project ready for user's AI assistant to execute.
Critical: AEGIS Transform does NOT execute changes. Phase 8 produces a plan. The user's AI assistant, operating through PAUL, executes the plan with human oversight at every gate.
AEGIS is invoked through slash commands — guided wizard experiences that delegate to workflows, present options, and manage the full audit-to-remediation pipeline.
| Command | Purpose |
|---|---|
/aegis:audit |
Initiate a full diagnostic audit — guided wizard that configures scope, runs tools, orchestrates agents, and produces findings |
/aegis:resume |
Resume an interrupted audit from the last completed phase |
/aegis:status |
Show current audit position, phase progress, and next action |
/aegis:report |
Generate or regenerate the final diagnostic report from completed findings |
| Command | Purpose |
|---|---|
/aegis:remediate |
Generate remediation knowledge (Layer B) from diagnostic findings |
/aegis:transform |
Generate execution plans (Layer C) from remediation knowledge |
/aegis:playbook |
View or regenerate remediation playbooks for specific findings |
/aegis:guardrails |
Generate project rules (.claude/CLAUDE.md, linter configs) from audit findings |
All commands use a wizard UX pattern: numbered options, cancel/back at every decision point, and clear confirmation before executing phases that consume significant resources.
Remediation introduces new risk. This is the fundamental tension of automated code evolution: the system that finds problems must not create worse ones.
Automated technical debt migration is worse than manual debt — it happens faster and with less understanding.
Every proposed change is scored across four dimensions:
How much breaks if the fix is wrong.
A change to a utility function called from 200 locations has extreme blast radius. A change to a leaf function called from one test has minimal blast radius. Blast radius is not severity — a low-severity change can have massive blast radius.
Does the fix create new dependencies.
Moving from inline SQL to an ORM introduces a dependency on the ORM. Extracting a function into a shared utility creates coupling between previously independent modules. Coupling risk asks: "Does this fix make the system harder to change in the future?"
Does the fix break existing behavior.
Measured by: test coverage of affected code paths, complexity of the change, number of implicit contracts that might be violated. A well-tested function with a simple change has low regression probability. An untested function with a complex refactor has high regression probability.
Does the fix fight the existing design.
Introducing a message queue into a synchronous request/response system creates architectural tension. The fix may be correct in isolation but inappropriate for the system as currently designed. Architectural tension asks: "Does this fix require changing the system's fundamental assumptions?"
Moving from Advisor to Architectural Actor requires a formal safety framework. AEGIS Core is an advisor — it reports findings. AEGIS Transform is an architectural actor — it proposes specific changes to codebases. The liability profile is fundamentally different.
Default to the lowest intervention level that serves the user.
When uncertain, suggest. Don't plan. When the plan is uncertain, don't authorize. When authorization is uncertain, don't execute. The cost of under-intervening (user applies a fix manually) is low. The cost of over-intervening (system proposes a change that breaks production) is catastrophic.
Do not generate remediation if finding confidence is below threshold.
| Intervention Level | Minimum Finding Confidence | Minimum Evidence Sources |
|---|---|---|
| Suggesting | Low | 1 |
| Planning | Medium | 2 |
| Authorizing | High | 3+ |
| Executing (via PAUL) | High | 3+ with cross-validation |
A finding with low confidence and a single evidence source produces, at most, a suggestion. Not a plan. Not a change. A suggestion.
Flag when change risk exceeds acceptable bounds.
If any change risk dimension (blast radius, coupling, regression probability, architectural tension) exceeds the "high" threshold, the system must:
- Flag the change as unsafe
- Downgrade intervention level to Suggesting (regardless of confidence)
- Explain why the change is risky
- Recommend human architectural review before proceeding
AEGIS Transform NEVER applies changes. Ever.
Transform produces plans. PAUL executes plans. The user approves every execution step. There is no bypass, no override, no "trusted mode" that allows AEGIS to modify a codebase directly. This is a hard architectural boundary, not a configuration option.
AEGIS Transform refuses to generate remediation when:
- Finding confidence is insufficient for the requested intervention level
- Change risk exceeds acceptable bounds and no downgrade is possible
- The codebase lacks sufficient test coverage to verify proposed changes
- Multiple high-severity disagreements remain unresolved for the affected findings
- The remediation would require changes to systems outside the audit scope
Refusal is a feature. A system that always produces output, regardless of certainty, is not safe — it is reckless.
Layer C generates PAUL-compatible project artifacts. This is how AEGIS hands off to the user's AI assistant for execution.
A complete PAUL project, ready to execute:
| Artifact | Contents |
|---|---|
PROJECT.md |
Project definition referencing the AEGIS audit, codebase target, and remediation scope |
ROADMAP.md |
Phased remediation plan with dependency ordering and verification gates |
| Phase plans | Per-phase PLAN.md files with tasks, acceptance criteria, and risk metadata |
Changes are sequenced by dependency, not severity. A critical finding that depends on a medium finding being fixed first cannot be prioritized first. AEGIS Transform produces dependency-aware phases:
- Foundation changes — Shared utilities, configurations, and infrastructure that other fixes depend on
- High-impact, low-risk changes — Quick wins that reduce the finding count and validate the remediation pipeline
- High-impact, high-risk changes — Major structural changes with verification gates at every step
- Cleanup and hardening — Guardrail installation, documentation updates, and monitoring configuration
Every PAUL phase includes verification criteria:
- Pre-change verification: Confirm the codebase is in the expected state before applying changes
- Post-change verification: Run tests, check behavior, validate that the fix actually fixes the finding
- Regression verification: Confirm that unrelated functionality still works
- Rollback criteria: Define when and how to undo a change that fails verification
Every PAUL task carries AEGIS risk metadata:
- Intervention level (suggesting/planning/authorizing/executing)
- Change risk scores (blast radius, coupling, regression, architectural tension)
- Finding confidence and evidence sources
- Verification plan reference
The user's AI assistant (operating through PAUL) can use this metadata to calibrate its own behavior — being more careful with high-risk tasks and more autonomous with low-risk ones.
AEGIS proposes. PAUL disposes.
AEGIS Transform produces the most informed, risk-scored, dependency-ordered remediation plan it can. Then it stops. The user's AI assistant, with human oversight through PAUL's checkpoint system, executes. The separation is absolute. AEGIS never crosses into execution territory.
The long-term value of AEGIS is not any single audit. It is the accumulation of verified architectural knowledge over time.
Anti-Pattern (found) → Correct Pattern (prescribed) → Verified Improvement (confirmed)
Every time an AEGIS audit finds a problem and Transform produces a remediation that is successfully applied and verified, the system accumulates:
- A confirmed anti-pattern — with real-world evidence, not textbook examples
- A proven remediation — with framework-specific implementation, not generic advice
- A verification methodology — how to prove the fix works, not just that it compiles
Over time, this accumulates into a proprietary failure-pattern corpus:
- Cross-project patterns — "This authentication anti-pattern appears in 40% of Laravel codebases we've audited"
- Framework-specific remediations — "This is how to fix unbounded retries in Express.js, validated across 12 projects"
- Risk calibration data — "Changes to authentication middleware have a 23% regression rate without dedicated test coverage"
- Verified migrations — "Moving from inline SQL to Eloquent ORM: these are the 7 things that break"
Most AI systems analyze code in isolation. Every audit starts from zero. AEGIS rehabilitates — and each rehabilitation makes the next one better.
The pattern corpus is the moat. The difference between a generic "you have SQL injection" and "we've fixed this exact pattern 47 times, here is the framework-specific playbook with a 94% first-attempt success rate" is the difference between a tool and an institution.
This is the intellectual core of AEGIS — the formal spine that prevents the system from becoming a pile of clever prompts.
All findings must be decomposed into epistemic layers. No agent is allowed to output a conclusion without explicitly passing through these layers.
Every finding is a structured object with seven layers:
What exists independently of interpretation.
- "Function
retryRequest()retries on HTTP 500" - "Config flag
ENABLE_LEGACY_FLOW=truein prod" - "Table
userslacks a unique constraint on email"
Rules: No adjectives. No risk language. Tool outputs live here.
Why we believe the observation is real.
Fields:
- Source type (static analysis, config file, runtime metric, log, commit history)
- Tool or artifact name
- Location (file, line, environment)
- Freshness (static / historical / live)
Purpose: Prevents tool bias, hallucinated certainty, and overweighting single sources.
What this observation means in context.
- "Unbounded retries can amplify load during partial outages"
- "Legacy flow bypasses new validation logic"
- "Duplicate emails can be created under race conditions"
Rules: Must explain causal mechanism. No value judgment yet. Multiple interpretations allowed.
What must be true for the interpretation to hold.
- "Service receives concurrent requests"
- "Flag is enabled in all regions"
- "Email uniqueness is required by business logic"
This is where the Devil's Advocate attacks.
What could go wrong if the interpretation is correct.
Format: If [interpretation], then [failure mode], impacting [asset]
Example: "Retry storms could overwhelm downstream services, causing cascading failures"
Severity modeling, not vibes.
Fields:
- Impact domain: security, data integrity, availability, compliance, velocity
- Impact magnitude: low, moderate, high, critical, existential
- Likelihood: rare, unlikely, possible, likely, frequent
- Time horizon: immediate, near-term, long-term, hypothetical
- Blast radius: localized, service-level, systemic, org-wide/legal/existential
What should be done about it.
Options: Must fix | Should fix | Accept risk | Monitor | Out of scope
Rules:
- Judgment is explicitly separated from facts
- Principal Engineer owns this layer
- Devil's Advocate may challenge but not decide
Each finding carries a confidence vector, not a scalar.
Confidence dimensions:
- Evidence diversity (1 tool vs many)
- Signal freshness (static vs runtime)
- Assumption fragility
- Historical precedent (known failure pattern?)
This enables statements like: "High-impact, low-confidence risk — validate before remediation." That's senior-level nuance.
These are non-negotiable:
- No risk statements without observations
- No judgments without risk modeling
- No confidence without evidence
- No synthesis without acknowledging uncertainty
- No "clean narrative" without Devil's Advocate response
Usually because of:
- Different threat models
- Different time horizons
- Different failure memories
- Different tolerance for risk
AEGIS surfaces these differences. It does not hide them.
Senior engineers do not vote. They reason under uncertainty using five canonical models:
Model 1 — Evidence Dominance "Which claim is better supported by independent signals?" Weight: number of tools, signal diversity, historical precedent.
Model 2 — Risk Asymmetry "If we're wrong, who pays and how badly?" Security and data risks often override performance disagreements.
Model 3 — Reversibility "How hard is it to undo this decision?" Irreversible decisions get stricter scrutiny.
Model 4 — Time-to-Failure "Which concern manifests first?" Near-term risks outrank theoretical long-term ones.
Model 5 — Blast Radius "How much breaks if this is wrong?" Localized risk < systemic risk.
Every disagreement is a structured record:
Disagreement {
id
finding_id
epistemic_layer_disputed // interpretation, assumptions, impact, likelihood, judgment
agents_involved
positions[] // one per agent, each with claim + evidence + assumptions + confidence
root_cause // from closed set (see below)
resolution_model_applied
principal_response // REQUIRED - silence is not allowed
status // open, mitigated, accepted_risk, deferred, out_of_scope
}
Root cause taxonomy (closed set):
- Threat model mismatch
- Time horizon mismatch
- Evidence availability mismatch
- Risk tolerance mismatch
- Domain boundary mismatch
- Optimism vs pessimism bias
- Tool trust bias
The Principal must explicitly respond to every Devil's Advocate critique and every unresolved disagreement:
- Acknowledge it explicitly
- Choose a resolution model (evidence dominance, risk asymmetry, reversibility, time-to-failure, accept risk)
- Record rationale
- Assign follow-up if needed
Status states: Open | Mitigated by evidence | Accepted risk | Deferred pending validation | Out of scope
No silent disappearance allowed.
- Auto-resolving disagreements
- Averaging opinions
- Forcing consensus language
- Hiding disagreement in footnotes
- Treating Devil's Advocate as optional
These destroy trust.
Agreement is cheap. Disagreement is where risk hides.
The goal is not convergence. The goal is epistemic transparency — showing leadership where understanding is weakest relative to risk.
Axis 1 — Severity vs Disagreement Intensity
| Low Disagreement | High Disagreement | |
|---|---|---|
| High Severity | ACT | CRITICAL ATTENTION |
| Low Severity | Ignore | Investigate lightly |
Axis 2 — Confidence Asymmetry Shows when Agent A has high confidence and Agent B has low confidence, revealing overconfidence risk and evidence imbalance.
Axis 3 — Evidence Diversity How many independent evidence sources support each side. Senior engineers often weight one runtime log + historical incident over three static tool outputs.
Axis 4 — Time Horizon Disagreements often aren't about if, but when. Visualizes: immediate, near-term, long-term, hypothetical.
Axis 5 — Blast Radius Localized, service-level, systemic, org-wide/legal/existential. This axis often breaks ties.
1. Disagreement Heatmap (Executive View) Rows = findings, color intensity = severity x disagreement x confidence gap. Tells leadership: "Where are we least sure about the most important things?"
2. Epistemic Stack Diff (Per Finding) Shows which layers are agreed vs disputed:
Observation [agreed]
Interpretation [disputed]
Assumptions [disputed]
Impact [disputed]
Judgment [deferred]
3. Agent Position Overlay For high-risk findings, plots each agent's position across impact, likelihood, and time horizon. Clusters vs outliers — outliers matter.
4. Devil's Advocate Focus Panel Dedicated view: findings where Devil's Advocate dissents, confidence vs evidence delta, Principal's response. If this panel is empty, the system is broken.
5. Assumption Fragility Graph Shows which assumptions multiple conclusions rely on and which are weakest/unverified. Identifies single-point epistemic failures.
Reality Gap = Difference between system behavior as inferred from code and behavior as it actually executes in production.
Most incidents live here. Most audits miss it entirely.
- Configuration — what's configured differently than the code assumes
- Environment — what's different between dev/staging/prod
- Runtime Control Planes — what external systems alter behavior
- Human Intervention — what manual processes bypass safeguards
RG-1: Configuration Drift
- Environment variables, YAML/JSON/HCL configs
- Default vs overridden values
- Secrets managers, region-specific configs
- Failure patterns: safe defaults overridden unsafely, flags enabled in prod only, test env != prod env
RG-2: Feature Flags & Kill Switches
- Flag inventory, ownership, lifetimes
- Conditional code paths
- Failure patterns: permanent "temporary" flags, untested flag combinations, flag-dependent logic bypassing invariants
- Senior question: "What code runs only when things go wrong?"
RG-3: Deployment & Infrastructure Overlay
- Kubernetes manifests, Terraform/CloudFormation
- Sidecars, proxies, service meshes, init containers, CronJobs
- Failure patterns: resource limits different from assumptions, hidden retries in proxies, timeouts enforced outside app code
RG-4: Runtime Behavior vs Static Intent
- Logs vs code paths, metrics vs expectations
- Observability coverage gaps, disabled instrumentation
- Failure patterns: dead code paths that are actually live, code never exercised in tests but hot in prod, silent failure paths
RG-5: Operational Overrides & Human Actions
- Hotfix mechanisms, manual scripts, admin endpoints
- One-off migrations, emergency patches
- Failure patterns: undocumented operational workflows, scripts with production authority, manual fixes that bypass safeguards
- This is where tribal knowledge hides
- Assumed vs Actual Behavior Table
- Code Paths Active Only in Production
- Flags & Configs That Change Control Flow
- Invisible Dependencies (proxies, retries, meshes)
- Highest-Risk Mismatches
Reality Gap findings are encoded using the same 7-layer epistemic schema. Uniform representation enables powerful synthesis.
Each language/runtime has unique ways to fail that generic analysis will never catch. AEGIS detects the project's language(s) and applies ecosystem-specific failure pattern catalogs.
Core principle: Every runtime lies to you in a different way.
Hidden failures: GC pressure from object churn, thread pool exhaustion, blocking calls inside async/reactive flows, memory leaks via static references, classloader leaks, poor equals/hashCode implementations, overuse of synchronized vs fine-grained locks.
Audit questions: Are allocations proportional to request volume? Are thread pools bounded and observable? Are blocking I/O calls hidden in async paths? Are caches unbounded?
Hidden failures: GIL-induced throughput collapse, async code that isn't actually async, silent exception swallowing, mutable default arguments, heavy reliance on global state, memory leaks via reference cycles, CPU-bound work in request threads.
Audit questions: Is concurrency real or illusionary? Are asyncio boundaries respected? Is CPU work isolated? Are retries idempotent?
Hidden failures: Event loop blocking, unhandled promise rejections, inconsistent async error handling, memory leaks via closures, excessive JSON serialization, dependency bloat (supply chain risk), TypeScript "any" erosion.
Audit questions: Can one slow request stall all others? Are async errors centrally handled? Is type safety enforced or aspirational? Are libraries pinned and audited?
Hidden failures: Goroutine leaks, context cancellation ignored, channel deadlocks, unbounded fan-out, hidden blocking syscalls, overuse of global state.
Audit questions: Are all goroutines bounded? Is context propagated everywhere? Are channels closed correctly? Is backpressure implemented?
Hidden failures: Unsafe blocks without justification, overly complex lifetimes (maintainability risk), panic paths in prod, blocking calls in async runtimes, premature optimization.
Audit questions: Why is unsafe needed here? Are panics recoverable? Is async runtime respected? Is complexity justified?
Hidden failures: N+1 queries, lock contention, missing indexes, overloaded migrations, weak isolation assumptions.
Hidden failures: Clock skew, partial failures, retry storms, cascading timeouts, eventual consistency violations.
| Tool | What It Does | Domains Served | Cost |
|---|---|---|---|
| SonarQube | Code smells, bugs, maintainability, duplication, complexity analysis | 1, 3, 6, 9 | Free (Community Edition) |
| Semgrep | Security-focused SAST — XSS, SQL injection, IDOR, hardcoded secrets, business logic vulnerabilities. 20,000+ rules, 30+ languages, 10-second median scan | 1, 3, 4, 5, 6, 9 | Free (OSS) / Paid (Pro) |
| Trivy | All-in-one security scanner — OS packages, app dependencies, IaC files, license compliance | 4, 5 | Free |
| Gitleaks | Secrets detection — scans git history for API keys, passwords, tokens. 160+ secret patterns | 4, 5 | Free |
| Tool | What It Does | Domains Served | Cost |
|---|---|---|---|
| Checkov | IaC security scanner — Terraform, CloudFormation, K8s, Helm, Dockerfiles. 3,000+ policies covering CIS benchmarks | 4, 5 | Free |
| Syft | SBOM generation — complete package inventory across all ecosystems (containers, filesystems, archives) | 4, 5 | Free |
| Grype | Vulnerability scanning — matches SBOM inventory against CVE databases. Paired with Syft for full supply chain analysis | 4, 5 | Free |
| Git History Miner | Git log mining — file churn rates, author concentration, change coupling, file age vs modification frequency | 11, 12 | Free (built-in) |
| Tool | What It Does | Domains Served | Cost |
|---|---|---|---|
| CodeScene | Hotspot analysis, code churn + complexity correlation, author concentration, change coupling, knowledge distribution, CodeHealth score (25+ factors) | 1, 9, 11, 12 | ~EUR 18/mo/author |
| Tool | What It Does | Domains Served | Cost |
|---|---|---|---|
| OpenSSF Scorecard | Scores open source projects 0-10 on security heuristics (branch protection, dependency pinning, CI tests, vulnerability disclosure). Scans top 1M projects weekly | 4, 12 | Free |
| depcruise / Madge | JavaScript/TypeScript dependency graph visualization and validation | 1 | Free |
| Language linters | ESLint (JS/TS), Pylint (Python), RuboCop (Ruby), Clippy (Rust), staticcheck (Go), SpotBugs (JVM) | 3, 9 | Free |
| Tool | What It Does | Notes |
|---|---|---|
| Structure101 / Lattix | Architecture visualization and dependency management | Enterprise-licensed, expensive. Claude can do much of this analysis directly for codebases under 100k LOC |
| CodeClimate | Automated code review for maintainability | Overlaps significantly with SonarQube |
| Snyk | Full platform (SAST, SCA, container, IaC) | Enterprise SaaS. Trivy covers most use cases for free |
Signal Normalization Layer Tools speak different languages. The signal schema (already specified) converts all findings into: severity, confidence, blast radius, domain relevance.
Cross-Signal Correlation Engine Where the system becomes exceptional:
- High churn + low tests = change risk
- Async code + blocking calls = latent perf bug
- PII fields + logs = compliance risk
AEGIS output spans three layers (see The Three Output Layers). The Core diagnostic pipeline (Phases 0-5) produces Layer A. The Transform pipeline (Phases 6-8) produces Layers B and C.
Every AEGIS Core run produces:
- Executive Risk Summary — One-page overview for leadership
- Architecture Narrative — How the system is built and why
- Findings by Domain (Severity-Ranked) — All 14 domains, each with epistemic-schema-structured findings
- Cross-Validation Notes — All disagreements, their root causes, and the Principal's resolutions
- Remediation Roadmap — Prioritized action plan
- Long-Term Structural Risks — What degrades over time
- "What Would Break First at 10x Scale" — Predictive failure analysis
When the Transform pipeline runs, it additionally produces:
- Remediation Playbooks — Per-finding playbooks at all 4 transformation layers (abstract → framework → language → project)
- Best-Practice Patterns — Correct patterns paired with detected anti-patterns
- Educational Context — Before/after examples and rationale for AI-assisted developers
- Generated Guardrails — Project rules (
.claude/CLAUDE.md,.cursorrules, linter configs)
The execution planning phase produces:
- Change Dependency Graph — Dependency-ordered sequence of all proposed changes
- Risk Assessment Report — Per-change scores across blast radius, coupling, regression, and architectural tension
- Verification Plan — Per-change verification steps with expected outcomes and rollback criteria
- PAUL Project — Complete, ready-to-execute PAUL project with phased plans, risk metadata, and verification gates
Phase 1 gathers evidence across six orthogonal dimensions:
- Structure — Static code analysis, architecture patterns, dependency graphs
- Behavior — Performance profiles, async analysis, flamegraphs
- History — Git churn, file age, author concentration, bug density over time
- Dependencies — Vulnerability scans, SBOM, supply chain health scores
- Policy Posture — IaC compliance, secrets exposure, governance policies
- Runtime Contracts — OpenAPI/gRPC schemas, backward compatibility checks
AEGIS v0.1.0 is a validated specification set — 90 markdown files across 8 component types that define the complete multi-agent audit system.
| Component Type | Count | Location |
|---|---|---|
| Domains | 14 | src/domains/ |
| Schemas | 9 (5 shared + 4 transform) | src/schemas/ + src/transform/schemas/ |
| Rules | 5 (3 shared + 2 transform) | src/rules/ + src/transform/rules/ |
| Tools | 8 | src/tools/ |
| Personas | 17 (12 core + 5 transform) | src/core/personas/ + src/transform/personas/ |
| Agents | 17 (12 core + 5 transform) | src/core/agents/ + src/transform/agents/ |
| Workflows | 12 (8 core + 4 transform) | src/core/workflows/ + src/transform/workflows/ |
| Commands | 8 (4 core + 4 transform) | commands/ |
| Total | 90 files | ~15,000 lines |
The specifications are the blueprints. They define every agent's identity, every domain's knowledge, every schema's contract, every tool's integration, every workflow's orchestration, and every command's UX. All cross-references have been validated (310 references, 0 broken). All files conform to their component type conventions. A version-lock manifest (SHA-256 content hashes) provides traceability for reproducible audit compositions.
v0.1 does not include runtime execution — actually orchestrating Claude Code sessions to run audits on real codebases. That requires a session orchestration layer, tool execution runtime, artifact persistence layer, and report generation engine. The specifications tell the runtime what to build. The runtime is the next milestone.
v0.2 makes AEGIS installable and runnable as a Claude Code extension. The framework installs to ~/.claude/aegis/ and commands install to ~/.claude/commands/aegis/, mirroring the PAUL framework installation pattern.
~/.claude/aegis/ # Framework files (installed)
├── core/
│ ├── agents/ # 12 Core agent manifests
│ ├── personas/ # 12 Core persona specs
│ └── workflows/ # 8 Core orchestration workflows
├── transform/
│ ├── agents/ # 5 Transform agent manifests
│ ├── personas/ # 5 Transform persona specs
│ ├── workflows/ # 4 Transform workflows
│ ├── schemas/ # 4 Transform schemas
│ └── rules/ # 2 Transform rules
├── domains/ # 14 audit domain knowledge modules
├── schemas/ # 5 shared schemas
├── rules/ # 3 shared rules
└── tools/ # 8 tool adapter specifications
~/.claude/commands/aegis/ # Slash commands (installed)
├── audit.md # /aegis:audit
├── resume.md # /aegis:resume
├── status.md # /aegis:status
├── report.md # /aegis:report
├── transform.md # /aegis:transform
├── remediate.md # /aegis:remediate
├── playbook.md # /aegis:playbook
└── guardrails.md # /aegis:guardrails
Prerequisites: Claude Code installed (~/.claude/ directory must exist)
One-command install (recommended):
curl -sSL https://raw.githubusercontent.com/ChristopherKahler/aegis/main/install.sh | bashOr clone and install locally:
git clone https://github.com/ChristopherKahler/aegis.git
cd aegis
bash install.shThe installer:
- Copies 82 framework files to
~/.claude/aegis/ - Installs 8 slash commands to
~/.claude/commands/aegis/ - Walks you through installing each OSS analysis tool (Y/N per tool)
- Verifies each installed tool works
- Auto-skips tools already installed on re-run
Tool installation methods:
- Python tools (Semgrep, Checkov): Installed via venv to
~/.local/share/aegis/venvs/— no sudo, PEP 668 compatible - Binary tools (Trivy, Gitleaks, Syft, Grype): Downloaded via curl to
~/.local/bin/ - SonarQube: Guided setup — Docker (local server) or SonarQube Cloud, with smart detection of existing Docker installations
- Git history: Always available (uses built-in git commands)
After installation, run /aegis:audit in Claude Code to start your first audit.
For a complete walkthrough, see Getting Started.
| Phase | Name | Status |
|---|---|---|
| 9 | Command Conversion | Complete — commands relocated, @ references rewritten, allowed-tools added |
| 10 | Install System | Complete — interactive installer with dual-mode (local + curl|bash), venv-based Python tools, smart SonarQube detection |
| 11 | Project Init & Validation | Complete — /aegis:init + /aegis:validate commands, .aegis/ project setup |
| 12 | Multi-Session UX | Complete — phase checkpoints (continue/pause/abort), session tracking, estimated remaining work |
| 13 | Getting Started | Complete — Getting Started guide, command reference |