-
Notifications
You must be signed in to change notification settings - Fork 100
Update azure-diagnostics Kubernetes references and snapshots #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
a018b1f
428ee22
320821c
ec403cf
7e26bc9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,103 @@ | ||||||||
| # Azure Kubernetes Service (AKS) Troubleshooting | ||||||||
|
|
||||||||
| > **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** | ||||||||
| > | ||||||||
| > This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Follow these instructions to diagnose and resolve common AKS problems systematically. | ||||||||
|
|
||||||||
| ## Overview | ||||||||
|
|
||||||||
| AKS troubleshooting covers pod failures, node issues, networking problems, and cluster-level failures. This guide provides systematic diagnosis flows and remediation steps for the most common issues. | ||||||||
|
|
||||||||
| ## Quick Diagnosis Flow | ||||||||
|
|
||||||||
| 1. **Identify symptoms** - What's failing? (Pods, nodes, networking, services?) | ||||||||
| 2. **Check cluster health** - Is AKS control plane healthy? | ||||||||
| 3. **Review events and logs** - What do Kubernetes events show? | ||||||||
| 4. **Isolate the issue** - Pod-level, node-level, or cluster-level? | ||||||||
| 5. **Apply targeted fixes** - Use the appropriate troubleshooting section | ||||||||
|
|
||||||||
| ## Troubleshooting Sections | ||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add the reference files directly in this section - which one aligns with which troubleshooting section? |
||||||||
|
|
||||||||
| ### Pod Failures & Application Issues | ||||||||
|
||||||||
| - CrashLoopBackOff, ImagePullBackOff, Pending pods | ||||||||
| - Readiness/liveness probe failures | ||||||||
| - Resource constraints (CPU/memory limits) | ||||||||
|
|
||||||||
| ### Node & Cluster Issues | ||||||||
| - Node NotReady conditions | ||||||||
| - Autoscaling failures | ||||||||
| - Resource pressure and capacity planning | ||||||||
| - Upgrade problems | ||||||||
|
|
||||||||
| ### Networking Problems | ||||||||
| - Service unreachable/connection refused | ||||||||
| - DNS resolution failures | ||||||||
| - Load balancer issues | ||||||||
| - Ingress routing failures | ||||||||
| - Network policy blocking | ||||||||
|
|
||||||||
| ## References | ||||||||
|
|
||||||||
| - [Networking Troubleshooting](networking.md) | ||||||||
| - [Node & Cluster Troubleshooting](node-issues.md) | ||||||||
|
|
||||||||
| ## Common Diagnostic Commands | ||||||||
|
||||||||
|
|
||||||||
| ```bash | ||||||||
| # Cluster overview | ||||||||
| kubectl get nodes -o wide | ||||||||
| kubectl get pods -A -o wide | ||||||||
| kubectl get events -A --sort-by='.lastTimestamp' | ||||||||
|
|
||||||||
| # Pod diagnostics | ||||||||
| kubectl describe pod <pod-name> -n <namespace> | ||||||||
| kubectl logs <pod-name> -n <namespace> --previous | ||||||||
|
|
||||||||
| # Node diagnostics | ||||||||
| kubectl describe node <node-name> | ||||||||
| kubectl get pods -n kube-system -o wide | ||||||||
|
|
||||||||
| # Networking diagnostics | ||||||||
| kubectl get svc -A | ||||||||
| kubectl get endpoints -A | ||||||||
| kubectl get networkpolicy -A | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## AKS-Specific Tools | ||||||||
|
|
||||||||
| ### Azure CLI Diagnostics | ||||||||
| ```bash | ||||||||
| # Check cluster status | ||||||||
| az aks show -g <rg> -n <cluster> --query "provisioningState" | ||||||||
|
|
||||||||
| # Get cluster credentials | ||||||||
| az aks get-credentials -g <rg> -n <cluster> | ||||||||
|
|
||||||||
| # View node pools | ||||||||
| az aks nodepool list -g <rg> --cluster-name <cluster> -o table | ||||||||
| ``` | ||||||||
|
|
||||||||
| ### AppLens (MCP) for AKS | ||||||||
| For AI-powered diagnostics: | ||||||||
| ``` | ||||||||
| mcp_azure_mcp_applens | ||||||||
| intent: "diagnose AKS cluster issues" | ||||||||
| command: "diagnose" | ||||||||
| parameters: | ||||||||
| resourceId: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>" | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Best Practices | ||||||||
|
|
||||||||
| 1. **Start with kubectl get/describe** - Always check basic status first | ||||||||
| 2. **Check events** - `kubectl get events -A` reveals recent issues | ||||||||
| 3. **Use systematic isolation** - Pod → Node → Cluster → Network | ||||||||
| 4. **Document changes** - Note what you tried and what worked | ||||||||
| 5. **Escalate when needed** - For control plane issues, contact Azure support | ||||||||
|
|
||||||||
| ## Related Skills | ||||||||
|
|
||||||||
| - **azure-diagnostics** - General Azure resource troubleshooting | ||||||||
| - **azure-deploy** - Deployment and configuration issues | ||||||||
| - **azure-observability** - Monitoring and logging setup | ||||||||
| ``` | ||||||||
|
||||||||
| - **azure-observability** - Monitoring and logging setup | |
| ``` | |
| - **azure-observability** - Monitoring and logging setup |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,153 @@ | ||||||||
| # Networking Troubleshooting | ||||||||
|
|
||||||||
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. | ||||||||
|
||||||||
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. | |
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see [networking CNI troubleshooting](./networking-cni.md). |
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same table formatting issue here: the double leading pipe (||) creates an extra empty column. Use single | for proper GitHub Markdown table rendering.
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guidance is technically incorrect for Kubernetes Ingress pathType: Prefix semantics—/api should match /api as well as /api/... (and not match /apix). Please correct the statement so users don’t apply unnecessary or incorrect ingress rules.
| | 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | | |
| | 404 for some paths | Path / pathType mismatch | With `pathType: Prefix`, `/api` matches `/api` and `/api/...` but not `/apix`; with `Exact`, `/api` and `/api/` are different | |
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSONPath example includes an embedded newline inside single quotes, which is easy to copy/paste incorrectly and can be confusing to readers. Consider rewriting this command as a single-line JSONPath (or use an explicit line continuation) to make it more robust for documentation consumers.
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]} | |
| {.metadata.name}{"\n"}{end}' | |
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated trigger snapshot now includes
deployas an extracted keyword, even though deployment requests are explicitly in the “DO NOT USE” section. This can increase false-positive routing of deployment-related prompts toazure-diagnostics. Consider rephrasing the “DO NOT USE” section (or formatting skill names) to avoid emitting bare deployment keywords, or adjust the keyword extraction to ignore the negative-use section.