-
Notifications
You must be signed in to change notification settings - Fork 100
Update azure-diagnostics Kubernetes references and snapshots #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
a018b1f
428ee22
320821c
ec403cf
7e26bc9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,127 @@ | ||||||||
| # Azure Kubernetes Service (AKS) Troubleshooting | ||||||||
|
|
||||||||
| > **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** | ||||||||
| > | ||||||||
| > This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Follow these instructions to diagnose and resolve common AKS problems systematically. | ||||||||
|
|
||||||||
| ## Overview | ||||||||
|
|
||||||||
| AKS troubleshooting covers pod failures, node issues, networking problems, and cluster-level failures. This guide provides systematic diagnosis flows and remediation steps for the most common issues. | ||||||||
|
|
||||||||
| ## Quick Diagnosis Flow | ||||||||
|
|
||||||||
| 1. **Identify symptoms** - What's failing? (Pods, nodes, networking, services?) | ||||||||
| 2. **Check cluster health** - Is AKS control plane healthy? | ||||||||
| 3. **Review events and logs** - What do Kubernetes events show? | ||||||||
| 4. **Isolate the issue** - Pod-level, node-level, or cluster-level? | ||||||||
| 5. **Apply targeted fixes** - Use the appropriate troubleshooting section | ||||||||
|
|
||||||||
| ## Troubleshooting Sections | ||||||||
|
|
||||||||
| ### Pod Failures & Application Issues | ||||||||
|
||||||||
| - CrashLoopBackOff, ImagePullBackOff, Pending pods | ||||||||
| - Readiness/liveness probe failures | ||||||||
| - Resource constraints (CPU/memory limits) | ||||||||
|
|
||||||||
| ### Node & Cluster Issues | ||||||||
| - Node NotReady conditions | ||||||||
| - Autoscaling failures | ||||||||
| - Resource pressure and capacity planning | ||||||||
| - Upgrade problems | ||||||||
|
|
||||||||
| ### Networking Problems | ||||||||
| - Service unreachable/connection refused | ||||||||
| - DNS resolution failures | ||||||||
| - Load balancer issues | ||||||||
| - Ingress routing failures | ||||||||
| - Network policy blocking | ||||||||
|
|
||||||||
| ## References | ||||||||
|
|
||||||||
| - [Networking Troubleshooting](networking.md) | ||||||||
| - [Node & Cluster Troubleshooting](node-issues.md) | ||||||||
|
|
||||||||
| ## General Investigation — "What happened in my cluster?" | ||||||||
|
|
||||||||
| When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this flow to surface recent activity and issues: | ||||||||
|
|
||||||||
| ``` | ||||||||
| 1. Cluster health → az aks show -g <rg> -n <cluster> --query "provisioningState" | ||||||||
| 2. Recent events → kubectl get events -A --sort-by='.lastTimestamp' | head -40 | ||||||||
| 3. Node status → kubectl get nodes -o wide | ||||||||
| 4. Unhealthy pods → kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | ||||||||
| 5. Activity log → az monitor activity-log list -g <rg> --max-events 20 -o table | ||||||||
| ``` | ||||||||
|
|
||||||||
| | Customer Question | Maps To | | ||||||||
| |-------------------|---------| | ||||||||
| | "What happened in my AKS cluster?" | Events + Activity log + Node status | | ||||||||
| | "Is my cluster healthy?" | Cluster provisioning state + Node conditions | | ||||||||
| | "Why are my pods failing?" | Unhealthy pods → `kubectl describe pod` → see Pod Failures section | | ||||||||
| | "My app is unreachable" | See [Networking Troubleshooting](networking.md) | | ||||||||
| | "Nodes are having issues" | See [Node & Cluster Troubleshooting](node-issues.md) | | ||||||||
|
|
||||||||
| > 💡 **Tip:** For AI-powered diagnostics, use AppLens MCP with the cluster resource ID — it automatically detects common issues and provides remediation recommendations. | ||||||||
|
|
||||||||
| --- | ||||||||
|
|
||||||||
| ## Common Diagnostic Commands | ||||||||
|
||||||||
|
|
||||||||
| ```bash | ||||||||
| # Cluster overview | ||||||||
| kubectl get nodes -o wide | ||||||||
| kubectl get pods -A -o wide | ||||||||
| kubectl get events -A --sort-by='.lastTimestamp' | ||||||||
|
|
||||||||
| # Pod diagnostics | ||||||||
| kubectl describe pod <pod-name> -n <namespace> | ||||||||
| kubectl logs <pod-name> -n <namespace> --previous | ||||||||
|
|
||||||||
| # Node diagnostics | ||||||||
| kubectl describe node <node-name> | ||||||||
| kubectl get pods -n kube-system -o wide | ||||||||
|
|
||||||||
| # Networking diagnostics | ||||||||
| kubectl get svc -A | ||||||||
| kubectl get endpoints -A | ||||||||
| kubectl get networkpolicy -A | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## AKS-Specific Tools | ||||||||
|
|
||||||||
| ### Azure CLI Diagnostics | ||||||||
| ```bash | ||||||||
| # Check cluster status | ||||||||
| az aks show -g <rg> -n <cluster> --query "provisioningState" | ||||||||
|
|
||||||||
| # Get cluster credentials | ||||||||
| az aks get-credentials -g <rg> -n <cluster> | ||||||||
|
|
||||||||
| # View node pools | ||||||||
| az aks nodepool list -g <rg> --cluster-name <cluster> -o table | ||||||||
| ``` | ||||||||
|
|
||||||||
| ### AppLens (MCP) for AKS | ||||||||
| For AI-powered diagnostics: | ||||||||
| ``` | ||||||||
| mcp_azure_mcp_applens | ||||||||
| intent: "diagnose AKS cluster issues" | ||||||||
| command: "diagnose" | ||||||||
| parameters: | ||||||||
| resourceId: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>" | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## Best Practices | ||||||||
|
|
||||||||
| 1. **Start with kubectl get/describe** - Always check basic status first | ||||||||
| 2. **Check events** - `kubectl get events -A` reveals recent issues | ||||||||
| 3. **Use systematic isolation** - Pod → Node → Cluster → Network | ||||||||
| 4. **Document changes** - Note what you tried and what worked | ||||||||
| 5. **Escalate when needed** - For control plane issues, contact Azure support | ||||||||
|
|
||||||||
| ## Related Skills | ||||||||
|
|
||||||||
| - **azure-diagnostics** - General Azure resource troubleshooting | ||||||||
| - **azure-deploy** - Deployment and configuration issues | ||||||||
| - **azure-observability** - Monitoring and logging setup | ||||||||
| ``` | ||||||||
|
||||||||
| - **azure-observability** - Monitoring and logging setup | |
| ``` | |
| - **azure-observability** - Monitoring and logging setup |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,153 @@ | ||||||||
| # Networking Troubleshooting | ||||||||
|
|
||||||||
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), check CNI pod health: `kubectl get pods -n kube-system -l k8s-app=azure-cni` and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network). | ||||||||
|
|
||||||||
| ## Service Unreachable / Connection Refused | ||||||||
|
|
||||||||
| **Diagnostics — always start here:** | ||||||||
| ```bash | ||||||||
| # 1. Verify service exists and has endpoints | ||||||||
| kubectl get svc <service-name> -n <ns> | ||||||||
| kubectl get endpoints <service-name> -n <ns> | ||||||||
|
|
||||||||
| # 2. Test connectivity from inside the namespace | ||||||||
| kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \ | ||||||||
| curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz | ||||||||
|
Comment on lines
+14
to
+15
|
||||||||
| ``` | ||||||||
|
|
||||||||
| **Decision tree:** | ||||||||
|
|
||||||||
| | Observation | Cause | Fix | | ||||||||
| |-------------|-------|-----| | ||||||||
| | Endpoints shows `<none>` | Label selector mismatch | Align selector with pod labels; check for typos | | ||||||||
|
Comment on lines
+20
to
+22
|
||||||||
| | Endpoints has IPs but unreachable | Port mismatch or app not listening | Confirm `targetPort` = actual container port | | ||||||||
| | Works from some pods, fails from others | Network policy blocking | See Network Policy section | | ||||||||
| | Works inside cluster, fails externally | Load balancer issue | See Load Balancer section | | ||||||||
| | `ECONNREFUSED` immediately | App not listening on that port | `kubectl exec <pod> -- netstat -tlnp` | | ||||||||
|
|
||||||||
| **Running but not Ready = removed from Endpoints silently.** Check `kubectl get pod <pod> -n <ns>` — READY must show `n/n`. If not, readiness probe is failing; fix probe or app health endpoint. | ||||||||
|
|
||||||||
| --- | ||||||||
|
|
||||||||
| ## DNS Resolution Failures | ||||||||
|
|
||||||||
| **Diagnostics:** | ||||||||
| ```bash | ||||||||
| # Confirm CoreDNS is running and healthy | ||||||||
| kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide | ||||||||
| kubectl top pod -n kube-system -l k8s-app=kube-dns # Check if CPU-throttled | ||||||||
|
|
||||||||
| # Live DNS test from the same namespace as the failing pod | ||||||||
| kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \ | ||||||||
| nslookup <service-name>.<ns>.svc.cluster.local | ||||||||
|
Comment on lines
+41
to
+42
|
||||||||
|
|
||||||||
|
|
||||||||
| # CoreDNS logs — errors show here first | ||||||||
| kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 | ||||||||
| ``` | ||||||||
|
|
||||||||
| **CoreDNS configmap:** `kubectl get configmap coredns -n kube-system -o yaml` — check `forward` plugin (upstream DNS), `cache` TTL, and any custom rewrites. | ||||||||
|
|
||||||||
| **AKS DNS failure patterns:** | ||||||||
|
|
||||||||
| | Symptom | Cause | Fix | | ||||||||
| |---------|-------|-----| | ||||||||
| | `NXDOMAIN` for `svc.cluster.local` | CoreDNS down or pod network broken | Restart CoreDNS pods; check CNI | | ||||||||
| | Internal resolves, external NXDOMAIN | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder | | ||||||||
| | Intermittent SERVFAIL under load | CoreDNS CPU throttled | Remove CPU limits or add replicas | | ||||||||
| | Private cluster — external names fail | Custom DNS missing privatelink forwarder | Add conditional forwarder to Azure DNS | | ||||||||
| | `i/o timeout` not `NXDOMAIN` | Port 53 blocked by NetworkPolicy or NSG | Allow UDP/TCP 53 from pods to kube-dns ClusterIP | | ||||||||
|
|
||||||||
| **Custom DNS on VNet — the most common AKS DNS trap:** | ||||||||
| Custom VNet DNS servers must forward `.cluster.local` to the CoreDNS ClusterIP and everything else to `168.63.129.16`. Breaking either path causes split DNS failures. | ||||||||
| ```bash | ||||||||
| kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}' | ||||||||
| # This IP must be the forward target for cluster.local in your custom DNS | ||||||||
| ``` | ||||||||
|
|
||||||||
| **CoreDNS under load:** Check `kubectl get hpa coredns -n kube-system` and `kubectl top pod -n kube-system -l k8s-app=kube-dns`. If CPU-throttled and no HPA, manually scale: `kubectl scale deployment coredns -n kube-system --replicas=3`. | ||||||||
|
|
||||||||
| --- | ||||||||
|
|
||||||||
| ## Load Balancer Stuck in Pending | ||||||||
|
|
||||||||
| **Diagnostics:** | ||||||||
| ```bash | ||||||||
| kubectl describe svc <svc> -n <ns> | ||||||||
| # Events section reveals the actual Azure error | ||||||||
|
|
||||||||
| kubectl logs -n kube-system -l component=cloud-controller-manager --tail=100 | ||||||||
| # Azure cloud provider logs — more detail than kubectl events | ||||||||
| ``` | ||||||||
|
|
||||||||
| **Error decision table:** | ||||||||
|
|
||||||||
| | Error in Events / CCM Logs | Cause | Fix | | ||||||||
| |----------------------------|-------|-----| | ||||||||
| | `InsufficientFreeAddresses` | Subnet has no free IPs | Expand subnet CIDR; use Azure CNI Overlay; use NAT gateway instead | | ||||||||
| | `ensure(default/svc): failed... PublicIPAddress quota` | Public IP quota exhausted | Request quota increase for Public IP Addresses in the region | | ||||||||
| | `cannot find NSG` | NSG name changed or detached | Re-associate NSG to the AKS subnet; check `az aks show` for NSG name | | ||||||||
| | `reconciling NSG rules: failed` | NSG is locked or has conflicting rules | Remove resource lock; check for deny-all rules above AKS-managed rules | | ||||||||
| | `subnet not found` | Wrong subnet name in annotation | Verify subnet name: `az network vnet subnet list -g <rg> --vnet-name <vnet>` | | ||||||||
| | No events, stuck Pending | CCM can't authenticate to Azure | Check cluster managed identity has `Network Contributor` on the VNet resource group | | ||||||||
|
|
||||||||
| **Internal LB annotations:** Set `service.beta.kubernetes.io/azure-load-balancer-internal: "true"` and `azure-load-balancer-internal-subnet: "<subnet-name>"`. Add `azure-load-balancer-ipv4: "10.x.x.x"` for a static private IP. | ||||||||
|
|
||||||||
| **CCM identity check:** If no events and LB is stuck, verify the cluster's managed identity has `Network Contributor` on the VNet resource group: `az aks show -g <rg> -n <cluster> --query "identity.principalId" -o tsv` then check role assignments. | ||||||||
|
|
||||||||
| --- | ||||||||
|
|
||||||||
| ## Ingress Not Routing Traffic | ||||||||
|
|
||||||||
| **Diagnostics:** | ||||||||
| ```bash | ||||||||
| # Confirm controller is running | ||||||||
| kubectl get pods -n <ingress-ns> -l 'app.kubernetes.io/name in (ingress-nginx,nginx-ingress)' | ||||||||
| kubectl logs -n <ingress-ns> -l app.kubernetes.io/name=ingress-nginx --tail=100 | ||||||||
|
|
||||||||
| # Check the ingress resource state | ||||||||
| kubectl describe ingress <name> -n <ns> | ||||||||
| kubectl get ingress <name> -n <ns> # ADDRESS must be populated | ||||||||
|
|
||||||||
| # Check backend | ||||||||
| kubectl get endpoints <backend-svc> -n <ns> | ||||||||
| ``` | ||||||||
|
|
||||||||
| **Ingress failure patterns:** | ||||||||
|
|
||||||||
| | Symptom | Cause | Fix | | ||||||||
| |---------|-------|-----| | ||||||||
| | ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` | | ||||||||
| | 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` | | ||||||||
| | 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | | ||||||||
|
||||||||
| | 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | | |
| | 404 for some paths | Path / pathType mismatch | With `pathType: Prefix`, `/api` matches `/api` and `/api/...` but not `/apix`; with `Exact`, `/api` and `/api/` are different | |
Copilot
AI
Mar 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSONPath example includes an embedded newline inside single quotes, which is easy to copy/paste incorrectly and can be confusing to readers. Consider rewriting this command as a single-line JSONPath (or use an explicit line continuation) to make it more robust for documentation consumers.
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]} | |
| {.metadata.name}{"\n"}{end}' | |
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the reference files directly in this section - which one aligns with which troubleshooting section?