Skip to content

Feedback on your operating-production-services skill #24

@RichardHightower

Description

@RichardHightower

Found your operating-production-services skill while browsing the registry—the way you've structured the progressive disclosure for such a dense topic (97/100 for a reason) makes me curious how you'd handle even more edge cases around observability and incident response.

Links:

The TL;DR

You're at 97/100, solidly in A-grade territory. This is based on Anthropic's skill best practices rubric. Your strongest area is Writing Style (10/10)—the skill reads like documentation written by someone who actually runs production systems, not a marketing pamphlet. Weakest spot is Spec Compliance (12/15), mostly because you're leaving discoverability points on the table with trigger phrases.

What's Working Well

  • Blameless postmortem framework - The 5 Whys template and postmortem meeting checklist give Claude concrete structure for handling incidents. That's the kind of thing teams actually need.
  • Token economy is chef's kiss - slo-alerting.md delegates heavy technical details while SKILL.md stays lean. You're not dumping a 200-line reference file on someone; you're layering it thoughtfully.
  • Practical burn rate guidance - The multi-window alerting patterns with specific Prometheus queries and Grafana dashboard structure mean Claude can actually implement this, not just read philosophy.
  • Clear scope boundaries - Your description explicitly calls out SLO alerting and postmortems while noting what you don't cover (deployment strategies, team structure). That's rare and helpful.

The Big One

slo-alerting.md (189 lines) is missing a table of contents. This hurts your navigation score because at 100+ lines, readers need an anchor point. Right now someone has to scroll through Prometheus rules, Grafana templates, and example YAMLs without knowing what's coming.

Add this at the top:

## Contents
- [Prometheus Recording Rules](#prometheus-recording-rules)
- [Multi-Window Burn Rate Alerts](#multi-window-burn-rate-alerts)
- [Burn Rate Reference](#burn-rate-reference)
- [Grafana Dashboard](#grafana-dashboard)
- [SLO Definition Template](#slo-definition-template)
- [Common Mistakes](#common-mistakes)

Impact: +1 point to PDA (gets you to 28/30).

Other Things Worth Fixing

  1. Expand trigger phrases in your frontmatter description - You're only hitting 1-2 right now. Add "error budget", "incident response", "reliability metrics" to catch more discovery queries. (-3 points on Spec Compliance; this could recover that easily).

  2. Add one more example template - You've got postmortem templates and SLO YAML. A quick Alertmanager config snippet showing how to route burn rate alerts would give Claude another angle to work from.

  3. Reference section could name-check - slo-alerting.md is good but it's generic. Could use a line in SKILL.md like "See references/slo-alerting.md for Prometheus query patterns and Grafana dashboard templates" to make the connection explicit.

Quick Wins

  • Add TOC to slo-alerting.md → +1 point
  • Expand trigger phrases (error budget, incident response, reliability) → +2-3 points
  • One more config example (Alertmanager routing) → +0-1 point

These three things could realistically push you to 99-100.


Checkout your skill here: [SkillzWave.ai](https://skillzwave.ai) | [SpillWave](https://spillwave.com) We have an agentic skill installer that install skills in 14+ coding agent platforms. Check out this guide on how to improve your agentic skills.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions