From a018b1f7f011a49029f248913dff44383a826f63 Mon Sep 17 00:00:00 2001 From: sammsft1 Date: Thu, 5 Mar 2026 16:30:04 -0500 Subject: [PATCH 1/4] Update azure-diagnostics Kubernetes references and snapshots --- plugin/skills/azure-diagnostics/SKILL.md | 6 +- .../references/azure-kubernetes/README.md | 103 ++++++++ .../references/azure-kubernetes/networking.md | 153 ++++++++++++ .../azure-kubernetes/node-issues.md | 225 ++++++++++++++++++ .../__snapshots__/triggers.test.ts.snap | 16 +- 5 files changed, 500 insertions(+), 3 deletions(-) create mode 100644 plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md create mode 100644 plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md create mode 100644 plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md diff --git a/plugin/skills/azure-diagnostics/SKILL.md b/plugin/skills/azure-diagnostics/SKILL.md index 6d84e271d..f1d970719 100644 --- a/plugin/skills/azure-diagnostics/SKILL.md +++ b/plugin/skills/azure-diagnostics/SKILL.md @@ -1,6 +1,6 @@ --- name: azure-diagnostics -description: "Debug and troubleshoot production issues on Azure. Covers Container Apps and Function Apps diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, and function invocation failures. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" +description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" license: MIT metadata: author: Microsoft @@ -51,6 +51,7 @@ Activate this skill when user wants to: |---------|---------------|-----------| | **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) | | **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) | +| **Azure Kubernetes** | Pod failures, node issues, networking problems, Helm deployment errors | [azure-kubernetes/](references/azure-kubernetes/README.md) | --- @@ -133,4 +134,5 @@ az monitor activity-log list -g RG --max-events 20 - [KQL Query Library](references/kql-queries.md) - [Azure Resource Graph Queries](references/azure-resource-graph.md) -- [Function Apps Troubleshooting](references/functions/README.md) \ No newline at end of file +- [Function Apps Troubleshooting](references/functions/README.md) +- [Azure Kubernetes Troubleshooting](references/azure-kubernetes/README.md) \ No newline at end of file diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md new file mode 100644 index 000000000..ca9510dd6 --- /dev/null +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md @@ -0,0 +1,103 @@ +# Azure Kubernetes Service (AKS) Troubleshooting + +> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** +> +> This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Follow these instructions to diagnose and resolve common AKS problems systematically. + +## Overview + +AKS troubleshooting covers pod failures, node issues, networking problems, and cluster-level failures. This guide provides systematic diagnosis flows and remediation steps for the most common issues. + +## Quick Diagnosis Flow + +1. **Identify symptoms** - What's failing? (Pods, nodes, networking, services?) +2. **Check cluster health** - Is AKS control plane healthy? +3. **Review events and logs** - What do Kubernetes events show? +4. **Isolate the issue** - Pod-level, node-level, or cluster-level? +5. **Apply targeted fixes** - Use the appropriate troubleshooting section + +## Troubleshooting Sections + +### Pod Failures & Application Issues +- CrashLoopBackOff, ImagePullBackOff, Pending pods +- Readiness/liveness probe failures +- Resource constraints (CPU/memory limits) + +### Node & Cluster Issues +- Node NotReady conditions +- Autoscaling failures +- Resource pressure and capacity planning +- Upgrade problems + +### Networking Problems +- Service unreachable/connection refused +- DNS resolution failures +- Load balancer issues +- Ingress routing failures +- Network policy blocking + +## References + +- [Networking Troubleshooting](networking.md) +- [Node & Cluster Troubleshooting](node-issues.md) + +## Common Diagnostic Commands + +```bash +# Cluster overview +kubectl get nodes -o wide +kubectl get pods -A -o wide +kubectl get events -A --sort-by='.lastTimestamp' + +# Pod diagnostics +kubectl describe pod -n +kubectl logs -n --previous + +# Node diagnostics +kubectl describe node +kubectl get pods -n kube-system -o wide + +# Networking diagnostics +kubectl get svc -A +kubectl get endpoints -A +kubectl get networkpolicy -A +``` + +## AKS-Specific Tools + +### Azure CLI Diagnostics +```bash +# Check cluster status +az aks show -g -n --query "provisioningState" + +# Get cluster credentials +az aks get-credentials -g -n + +# View node pools +az aks nodepool list -g --cluster-name -o table +``` + +### AppLens (MCP) for AKS +For AI-powered diagnostics: +``` +mcp_azure_mcp_applens + intent: "diagnose AKS cluster issues" + command: "diagnose" + parameters: + resourceId: "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/" +``` + +## Best Practices + +1. **Start with kubectl get/describe** - Always check basic status first +2. **Check events** - `kubectl get events -A` reveals recent issues +3. **Use systematic isolation** - Pod → Node → Cluster → Network +4. **Document changes** - Note what you tried and what worked +5. **Escalate when needed** - For control plane issues, contact Azure support + +## Related Skills + +- **azure-diagnostics** - General Azure resource troubleshooting +- **azure-deploy** - Deployment and configuration issues +- **azure-observability** - Monitoring and logging setup +``` \ No newline at end of file diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md new file mode 100644 index 000000000..37102ed11 --- /dev/null +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md @@ -0,0 +1,153 @@ +# Networking Troubleshooting + +> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. + +## Service Unreachable / Connection Refused + +**Diagnostics — always start here:** +```bash +# 1. Verify service exists and has endpoints +kubectl get svc -n +kubectl get endpoints -n + +# 2. Test connectivity from inside the namespace +kubectl run netdebug --image=curlimages/curl -it --rm -n -- \ + curl -sv http://..svc.cluster.local:/healthz +``` + +**Decision tree:** + +| Observation | Cause | Fix | +|-------------|-------|-----| +| Endpoints shows `` | Label selector mismatch | Align selector with pod labels; check for typos | +| Endpoints has IPs but unreachable | Port mismatch or app not listening | Confirm `targetPort` = actual container port | +| Works from some pods, fails from others | Network policy blocking | See Network Policy section | +| Works inside cluster, fails externally | Load balancer issue | See Load Balancer section | +| `ECONNREFUSED` immediately | App not listening on that port | `kubectl exec -- netstat -tlnp` | + +**Running but not Ready = removed from Endpoints silently.** Check `kubectl get pod -n ` — READY must show `n/n`. If not, readiness probe is failing; fix probe or app health endpoint. + +--- + +## DNS Resolution Failures + +**Diagnostics:** +```bash +# Confirm CoreDNS is running and healthy +kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide +kubectl top pod -n kube-system -l k8s-app=kube-dns # Check if CPU-throttled + +# Live DNS test from the same namespace as the failing pod +kubectl run dnstest --image=busybox:1.28 -it --rm -n -- \ + nslookup ..svc.cluster.local + + +# CoreDNS logs — errors show here first +kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 +``` + +**CoreDNS configmap:** `kubectl get configmap coredns -n kube-system -o yaml` — check `forward` plugin (upstream DNS), `cache` TTL, and any custom rewrites. + +**AKS DNS failure patterns:** + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `NXDOMAIN` for `svc.cluster.local` | CoreDNS down or pod network broken | Restart CoreDNS pods; check CNI | +| Internal resolves, external NXDOMAIN | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder | +| Intermittent SERVFAIL under load | CoreDNS CPU throttled | Remove CPU limits or add replicas | +| Private cluster — external names fail | Custom DNS missing privatelink forwarder | Add conditional forwarder to Azure DNS | +| `i/o timeout` not `NXDOMAIN` | Port 53 blocked by NetworkPolicy or NSG | Allow UDP/TCP 53 from pods to kube-dns ClusterIP | + +**Custom DNS on VNet — the most common AKS DNS trap:** +Custom VNet DNS servers must forward `.cluster.local` to the CoreDNS ClusterIP and everything else to `168.63.129.16`. Breaking either path causes split DNS failures. +```bash +kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}' +# This IP must be the forward target for cluster.local in your custom DNS +``` + +**CoreDNS under load:** Check `kubectl get hpa coredns -n kube-system` and `kubectl top pod -n kube-system -l k8s-app=kube-dns`. If CPU-throttled and no HPA, manually scale: `kubectl scale deployment coredns -n kube-system --replicas=3`. + +--- + +## Load Balancer Stuck in Pending + +**Diagnostics:** +```bash +kubectl describe svc -n +# Events section reveals the actual Azure error + +kubectl logs -n kube-system -l component=cloud-controller-manager --tail=100 +# Azure cloud provider logs — more detail than kubectl events +``` + +**Error decision table:** + +| Error in Events / CCM Logs | Cause | Fix | +|----------------------------|-------|-----| +| `InsufficientFreeAddresses` | Subnet has no free IPs | Expand subnet CIDR; use Azure CNI Overlay; use NAT gateway instead | +| `ensure(default/svc): failed... PublicIPAddress quota` | Public IP quota exhausted | Request quota increase for Public IP Addresses in the region | +| `cannot find NSG` | NSG name changed or detached | Re-associate NSG to the AKS subnet; check `az aks show` for NSG name | +| `reconciling NSG rules: failed` | NSG is locked or has conflicting rules | Remove resource lock; check for deny-all rules above AKS-managed rules | +| `subnet not found` | Wrong subnet name in annotation | Verify subnet name: `az network vnet subnet list -g --vnet-name ` | +| No events, stuck Pending | CCM can't authenticate to Azure | Check cluster managed identity has `Network Contributor` on the VNet resource group | + +**Internal LB annotations:** Set `service.beta.kubernetes.io/azure-load-balancer-internal: "true"` and `azure-load-balancer-internal-subnet: ""`. Add `azure-load-balancer-ipv4: "10.x.x.x"` for a static private IP. + +**CCM identity check:** If no events and LB is stuck, verify the cluster's managed identity has `Network Contributor` on the VNet resource group: `az aks show -g -n --query "identity.principalId" -o tsv` then check role assignments. + +--- + +## Ingress Not Routing Traffic + +**Diagnostics:** +```bash +# Confirm controller is running +kubectl get pods -n -l 'app.kubernetes.io/name in (ingress-nginx,nginx-ingress)' +kubectl logs -n -l app.kubernetes.io/name=ingress-nginx --tail=100 + +# Check the ingress resource state +kubectl describe ingress -n +kubectl get ingress -n # ADDRESS must be populated + +# Check backend +kubectl get endpoints -n +``` + +**Ingress failure patterns:** + +| Symptom | Cause | Fix | +|---------|-------|-----| +| ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` | +| 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` | +| 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | +| 502 Bad Gateway | Backend pods unhealthy or wrong port | Verify Endpoints has IPs; confirm `targetPort` and readiness | +| 503 Service Unavailable | All backend pods down | Check pod restarts and readiness probe | +| TLS handshake fail | cert-manager not issuing | `kubectl describe certificate -n `; check ACME challenge | +| Works for host-a, 404 for host-b | DNS not pointing to ingress IP | `nslookup ` must resolve to ingress ADDRESS | + +**Application Routing add-on:** `az aks show -g -n --query "ingressProfile"` — if enabled, use `ingressClassName: webapprouting.kubernetes.azure.com`. + +--- + +## Network Policy Blocking Traffic + +**Finding which policy is blocking (the hard part):** +```bash +# List all policies in the namespace — check both ingress and egress +kubectl get networkpolicy -n -o yaml + +# Check for a default-deny policy (blocks everything unless explicitly allowed) +kubectl get networkpolicy -n -o jsonpath='{range .items[?(@.spec.podSelector=={})]} + {.metadata.name}{"\n"}{end}' + +# Simulate traffic to identify the block +kubectl run probe --image=curlimages/curl -n -it --rm -- \ + curl -v --connect-timeout 3 http://: +# Timeout = network policy blocking. Connection refused = reached pod but app issue. +``` + +**Policy audit checklist:** (1) Get source pod labels. (2) Get destination pod labels. (3) Check destination namespace for ingress policy — does it allow from source labels? (4) Check source namespace for egress policy — does it allow to destination labels? Both directions need explicit allow rules if default-deny exists. + +**AKS network policy engine check:** Azure NPM (Azure CNI): `kubectl get pods -n kube-system -l k8s-app=azure-npm`. Calico: `kubectl get pods -n calico-system`. + +**Common default-deny escape:** Always add an egress policy allowing UDP/TCP port 53 to the kube-dns service IP — this is the most frequently forgotten rule when adding a default-deny NetworkPolicy. \ No newline at end of file diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md new file mode 100644 index 000000000..d310c231f --- /dev/null +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md @@ -0,0 +1,225 @@ +# Node & Cluster Troubleshooting + +## Node NotReady + +**Diagnostics:** +```bash +kubectl get nodes -o wide +kubectl describe node +# Look for: Conditions, Taints, Events, resource usage, kubelet status +``` + +**Condition decision tree:** + +| Condition | Value | Meaning | Fix Path | +|-----------|-------|---------|----------| +| `Ready` | `False` | kubelet stopped reporting | SSH to node or cordon + drain + delete | +| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density | +| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk | +| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes | +| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config | + +**AKS-specific — SSH to a node:** +```bash +# Create a privileged debug pod on the node +kubectl debug node/ -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0 + +# Check kubelet status inside the node +chroot /host systemctl status kubelet +chroot /host journalctl -u kubelet -n 50 +``` + +**If kubelet can't recover:** cordon → drain → delete. AKS auto-replaces via node pool VMSS. +```bash +kubectl cordon +kubectl drain --ignore-daemonsets --delete-emptydir-data +kubectl delete node +``` + +--- + +## Node Pool Not Scaling + +### Cluster Autoscaler Not Triggering + +**Diagnostics:** +```bash +# Autoscaler logs +kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 + +# Autoscaler status +kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml + +# Verify autoscaler is enabled on the node pool +az aks nodepool show -g --cluster-name -n \ + --query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}" +``` + +**Autoscaler won't scale up — common reasons:** +- Node pool already at `maxCount` +- VM quota exhausted: `az vm list-usage -l -o table | grep -i "DSv3\|quota"` +- Pod `nodeAffinity` is unsatisfiable on any new node template +- 10-minute cooldown period still active after last scale event + +**Autoscaler won't scale down — common reasons:** +- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe) +- Standalone pods with no controller (not in a ReplicaSet) +- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod + +### Manual Scaling + +```bash +az aks nodepool scale -g --cluster-name -n --node-count +``` + +--- + +## Resource Pressure & Capacity Planning + +**Check actual vs allocatable:** +```bash +kubectl describe node | grep -A6 "Allocated resources:" +# Shows: CPU requests/limits, Memory requests/limits, Ephemeral storage +``` + +**AKS resource reservation table (approximate):** + +| VM Size | Total Memory | AKS Reserved | Allocatable | +|---------|-------------|--------------|-------------| +| Standard_D2s_v3 | 8 GB | ~1.7 GB | ~6.3 GB | +| Standard_D4s_v3 | 16 GB | ~2.3 GB | ~13.7 GB | +| Standard_D8s_v3 | 32 GB | ~3.5 GB | ~28.5 GB | + +See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for the CPU formula. + +**Ephemeral storage pressure:** +```bash +# Check what's consuming ephemeral storage on a node +kubectl debug node/ -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0 +# Inside: df -h /host/var/lib/docker or /host/var/lib/containerd +``` + +Common culprit: containers writing to stdout at high volume — logs accumulate in `/var/log/containers`. AKS log rotation can still be overwhelmed by aggressive logging. + +--- + +## Node Image / OS Upgrade Issues + +```bash +# Check current node image versions +az aks nodepool show -g --cluster-name -n \ + --query "{nodeImageVersion:nodeImageVersion, osType:osType}" + +# Check available upgrades +az aks nodepool get-upgrades -g --cluster-name --nodepool-name + +# Upgrade node image (non-disruptive with surge) +az aks nodepool upgrade -g --cluster-name -n --node-image-only +``` + +--- + +## Kubernetes Version Upgrade Failures + +**Pre-upgrade check:** +```bash +# Check for deprecated API usage before upgrading +kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis + +# Verify available upgrade paths (can only skip one minor version) +az aks get-upgrades -g -n -o table +``` + +**Upgrade stuck or failed:** +```bash +# Check control plane provisioning state +az aks show -g -n --query "provisioningState" + +# If stuck: check AKS diagnostics blade in portal +# Azure Portal → AKS cluster → Diagnose and solve problems → Upgrade +``` + +Common causes: PDB blocking drain (`kubectl get pdb -A`), deprecated APIs in use, custom admission webhooks failing (`kubectl get validatingwebhookconfiguration`). + +--- + +## Spot Node Pool Evictions + +AKS spot nodes use Azure Spot VMs — they can be evicted with 30 seconds notice when Azure needs capacity. + +**Diagnose spot eviction:** +```bash +# Spot nodes carry this taint automatically +kubectl describe node | grep "Taint" +# kubernetes.azure.com/scalesetpriority=spot:NoSchedule + +# Check eviction events +kubectl get events -A --field-selector reason=SpotEviction +kubectl get events -A | grep -i "evict\|spot\|preempt" +``` + +**Spot workload requirements:** pods must tolerate the spot taint; use PDBs; avoid stateful PVC workloads on spot (disk detach can lag eviction). Pattern — toleration + preferred affinity: +```yaml +tolerations: +- key: "kubernetes.azure.com/scalesetpriority" + operator: Equal + value: spot + effect: NoSchedule +affinity: + nodeAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 1 + preference: + matchExpressions: + - key: kubernetes.azure.com/scalesetpriority + operator: In + values: ["spot"] +``` + +--- + +## Multi-AZ Node Pool & Zone-Related Failures + +AKS supports zone-redundant node pools that spread nodes across Availability Zones. Zone awareness affects scheduling, storage, and upgrade behavior. + +**Check zone distribution:** +```bash +kubectl get nodes -L topology.kubernetes.io/zone +# Nodes should distribute across zones 1, 2, 3 +``` + +**Zone-related failure patterns:** + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Pods stack on one zone after node failures | Scheduling imbalance after zone failure | `kubectl rollout restart deployment/` to rebalance | +| PVC pending with `volume node affinity conflict` | Azure Disk is zonal; pod scheduled in different zone | Use ZRS storage class or ensure PVC and pod are in same zone | +| Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys | +| Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings | + +**ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See `references/storage.md` for the ZRS StorageClass spec. + +--- + +## Zero-Downtime Node Pool Upgrades + +The `maxSurge` setting controls how many extra nodes are provisioned during upgrade. Default is 1, which means nodes are upgraded one at a time. + +```bash +# Check current maxSurge +az aks nodepool show -g --cluster-name -n \ + --query "upgradeSettings.maxSurge" + +# Set maxSurge to 33% for faster, safer upgrades (provisions 1/3 extra nodes first) +az aks nodepool update -g --cluster-name -n \ + --max-surge 33% +``` + +**Upgrade stuck / nodes not draining:** +```bash +kubectl get pdb -A +kubectl describe pdb -n +# "DisruptionsAllowed: 0" → no pods can be evicted → upgrade hangs +``` + +Fix: scale up the deployment so `DisruptionsAllowed` becomes ≥ 1, or temporarily relax `minAvailable`. Restore after upgrade. \ No newline at end of file diff --git a/tests/azure-diagnostics/__snapshots__/triggers.test.ts.snap b/tests/azure-diagnostics/__snapshots__/triggers.test.ts.snap index c626bf2ca..45a5cd900 100644 --- a/tests/azure-diagnostics/__snapshots__/triggers.test.ts.snap +++ b/tests/azure-diagnostics/__snapshots__/triggers.test.ts.snap @@ -2,7 +2,7 @@ exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill description triggers match snapshot 1`] = ` { - "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps and Function Apps diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, and function invocation failures. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", + "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", "extractedKeywords": [ "analysis", "analyze", @@ -23,8 +23,10 @@ exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill descr "container", "cost", "covers", + "crashing", "creating", "debug", + "deploy", "deploying", "diagnostic", "diagnostics", @@ -39,16 +41,21 @@ exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill descr "invocation", "issue", "issues", + "kubernetes", "logs", "mcp", "monitor", "monitoring", + "networking", + "node", "optimization", "probe", "probes", + "problems", "production", "pull", "pulls", + "ready", "resolution", "resolve", "resource", @@ -87,8 +94,10 @@ exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill keywo "container", "cost", "covers", + "crashing", "creating", "debug", + "deploy", "deploying", "diagnostic", "diagnostics", @@ -103,16 +112,21 @@ exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill keywo "invocation", "issue", "issues", + "kubernetes", "logs", "mcp", "monitor", "monitoring", + "networking", + "node", "optimization", "probe", "probes", + "problems", "production", "pull", "pulls", + "ready", "resolution", "resolve", "resource", From 428ee22ad43d44046ff88bf5a1463994166051a8 Mon Sep 17 00:00:00 2001 From: Samantha Fernandez Date: Thu, 5 Mar 2026 17:51:24 -0500 Subject: [PATCH 2/4] Update plugin/skills/azure-diagnostics/SKILL.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- plugin/skills/azure-diagnostics/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugin/skills/azure-diagnostics/SKILL.md b/plugin/skills/azure-diagnostics/SKILL.md index f1d970719..807386b97 100644 --- a/plugin/skills/azure-diagnostics/SKILL.md +++ b/plugin/skills/azure-diagnostics/SKILL.md @@ -51,7 +51,7 @@ Activate this skill when user wants to: |---------|---------------|-----------| | **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) | | **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) | -| **Azure Kubernetes** | Pod failures, node issues, networking problems, Helm deployment errors | [azure-kubernetes/](references/azure-kubernetes/README.md) | +| **Azure Kubernetes** | Pod failures, node issues, networking problems | [azure-kubernetes/](references/azure-kubernetes/README.md) | --- From 320821c42f5e312acf77eb8c9a7e30399621f74e Mon Sep 17 00:00:00 2001 From: sammsft1 Date: Fri, 6 Mar 2026 02:01:37 -0500 Subject: [PATCH 3/4] Editing SKILL.md file description (token limit) Also, updating the networking and node issues references to provide more comprehensive troubleshooting steps for Kubernetes on Azure. --- plugin/skills/azure-diagnostics/SKILL.md | 2 +- .../references/azure-kubernetes/README.md | 24 +++++++++++++++++++ .../references/azure-kubernetes/networking.md | 2 +- .../azure-kubernetes/node-issues.md | 2 +- 4 files changed, 27 insertions(+), 3 deletions(-) diff --git a/plugin/skills/azure-diagnostics/SKILL.md b/plugin/skills/azure-diagnostics/SKILL.md index f1d970719..9e6f7b58f 100644 --- a/plugin/skills/azure-diagnostics/SKILL.md +++ b/plugin/skills/azure-diagnostics/SKILL.md @@ -1,6 +1,6 @@ --- name: azure-diagnostics -description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" +description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and Azure Kubernetes Service diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" license: MIT metadata: author: Microsoft diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md index ca9510dd6..82c4371e3 100644 --- a/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md @@ -41,6 +41,30 @@ AKS troubleshooting covers pod failures, node issues, networking problems, and c - [Networking Troubleshooting](networking.md) - [Node & Cluster Troubleshooting](node-issues.md) +## General Investigation — "What happened in my cluster?" + +When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this flow to surface recent activity and issues: + +``` +1. Cluster health → az aks show -g -n --query "provisioningState" +2. Recent events → kubectl get events -A --sort-by='.lastTimestamp' | head -40 +3. Node status → kubectl get nodes -o wide +4. Unhealthy pods → kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded +5. Activity log → az monitor activity-log list -g --max-events 20 -o table +``` + +| Customer Question | Maps To | +|-------------------|---------| +| "What happened in my AKS cluster?" | Events + Activity log + Node status | +| "Is my cluster healthy?" | Cluster provisioning state + Node conditions | +| "Why are my pods failing?" | Unhealthy pods → `kubectl describe pod` → see Pod Failures section | +| "My app is unreachable" | See [Networking Troubleshooting](networking.md) | +| "Nodes are having issues" | See [Node & Cluster Troubleshooting](node-issues.md) | + +> 💡 **Tip:** For AI-powered diagnostics, use AppLens MCP with the cluster resource ID — it automatically detects common issues and provides remediation recommendations. + +--- + ## Common Diagnostic Commands ```bash diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md index 37102ed11..c142501a1 100644 --- a/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md @@ -1,6 +1,6 @@ # Networking Troubleshooting -> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. +> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), check CNI pod health: `kubectl get pods -n kube-system -l k8s-app=azure-cni` and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network). ## Service Unreachable / Connection Refused diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md index d310c231f..4e0d18822 100644 --- a/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md @@ -197,7 +197,7 @@ kubectl get nodes -L topology.kubernetes.io/zone | Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys | | Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings | -**ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See `references/storage.md` for the ZRS StorageClass spec. +**ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See [AKS storage best practices](https://learn.microsoft.com/azure/aks/operator-best-practices-storage) for the ZRS StorageClass spec. --- From 7e26bc9e6592b209abe51f45a23e487a79e31d26 Mon Sep 17 00:00:00 2001 From: sammsft1 Date: Tue, 10 Mar 2026 16:03:33 -0400 Subject: [PATCH 4/4] Updating the original skills file with the right references --- .../references/azure-kubernetes/README.md | 135 +++-------------- .../azure-kubernetes/general-diagnostics.md | 63 ++++++++ .../azure-kubernetes/pod-failures.md | 140 ++++++++++++++++++ 3 files changed, 227 insertions(+), 111 deletions(-) create mode 100644 plugin/skills/azure-diagnostics/references/azure-kubernetes/general-diagnostics.md create mode 100644 plugin/skills/azure-diagnostics/references/azure-kubernetes/pod-failures.md diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md index 82c4371e3..20a35cd48 100644 --- a/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md @@ -2,126 +2,39 @@ > **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** > -> This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Follow these instructions to diagnose and resolve common AKS problems systematically. - -## Overview - -AKS troubleshooting covers pod failures, node issues, networking problems, and cluster-level failures. This guide provides systematic diagnosis flows and remediation steps for the most common issues. +> This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Use the reference files below for detailed diagnosis and remediation steps. ## Quick Diagnosis Flow -1. **Identify symptoms** - What's failing? (Pods, nodes, networking, services?) -2. **Check cluster health** - Is AKS control plane healthy? -3. **Review events and logs** - What do Kubernetes events show? -4. **Isolate the issue** - Pod-level, node-level, or cluster-level? -5. **Apply targeted fixes** - Use the appropriate troubleshooting section +1. **Identify symptoms** — What's failing? (Pods, nodes, networking, services?) +2. **Check cluster health** — Is AKS control plane healthy? +3. **Review events and logs** — What do Kubernetes events show? +4. **Isolate the issue** — Pod-level, node-level, or cluster-level? +5. **Apply targeted fixes** — Use the appropriate reference file below ## Troubleshooting Sections -### Pod Failures & Application Issues -- CrashLoopBackOff, ImagePullBackOff, Pending pods -- Readiness/liveness probe failures -- Resource constraints (CPU/memory limits) - -### Node & Cluster Issues -- Node NotReady conditions -- Autoscaling failures -- Resource pressure and capacity planning -- Upgrade problems - -### Networking Problems -- Service unreachable/connection refused -- DNS resolution failures -- Load balancer issues -- Ingress routing failures -- Network policy blocking - -## References - -- [Networking Troubleshooting](networking.md) -- [Node & Cluster Troubleshooting](node-issues.md) - -## General Investigation — "What happened in my cluster?" - -When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this flow to surface recent activity and issues: - -``` -1. Cluster health → az aks show -g -n --query "provisioningState" -2. Recent events → kubectl get events -A --sort-by='.lastTimestamp' | head -40 -3. Node status → kubectl get nodes -o wide -4. Unhealthy pods → kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -5. Activity log → az monitor activity-log list -g --max-events 20 -o table -``` - -| Customer Question | Maps To | -|-------------------|---------| -| "What happened in my AKS cluster?" | Events + Activity log + Node status | -| "Is my cluster healthy?" | Cluster provisioning state + Node conditions | -| "Why are my pods failing?" | Unhealthy pods → `kubectl describe pod` → see Pod Failures section | -| "My app is unreachable" | See [Networking Troubleshooting](networking.md) | -| "Nodes are having issues" | See [Node & Cluster Troubleshooting](node-issues.md) | - -> 💡 **Tip:** For AI-powered diagnostics, use AppLens MCP with the cluster resource ID — it automatically detects common issues and provides remediation recommendations. - ---- - -## Common Diagnostic Commands - -```bash -# Cluster overview -kubectl get nodes -o wide -kubectl get pods -A -o wide -kubectl get events -A --sort-by='.lastTimestamp' - -# Pod diagnostics -kubectl describe pod -n -kubectl logs -n --previous - -# Node diagnostics -kubectl describe node -kubectl get pods -n kube-system -o wide - -# Networking diagnostics -kubectl get svc -A -kubectl get endpoints -A -kubectl get networkpolicy -A -``` - -## AKS-Specific Tools - -### Azure CLI Diagnostics -```bash -# Check cluster status -az aks show -g -n --query "provisioningState" - -# Get cluster credentials -az aks get-credentials -g -n - -# View node pools -az aks nodepool list -g --cluster-name -o table -``` +| Scenario | Reference File | Covers | +|----------|---------------|--------| +| Pod Failures & Application Issues | [pod-failures.md](pod-failures.md) | CrashLoopBackOff, ImagePullBackOff, Pending pods, readiness/liveness probe failures, resource constraints (CPU/memory) | +| Node & Cluster Issues | [node-issues.md](node-issues.md) | Node NotReady, autoscaling failures, resource pressure, upgrade problems, spot evictions, multi-AZ, zero-downtime upgrades | +| Networking Problems | [networking.md](networking.md) | Service unreachable, DNS resolution failures, load balancer issues, ingress routing, network policy blocking | +| General Investigation | [general-diagnostics.md](general-diagnostics.md) | Cluster health checks, AKS CLI tools, AppLens diagnostics, best practices | -### AppLens (MCP) for AKS -For AI-powered diagnostics: -``` -mcp_azure_mcp_applens - intent: "diagnose AKS cluster issues" - command: "diagnose" - parameters: - resourceId: "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/" -``` +## Quick Question Router -## Best Practices +| Customer Question | Start Here | +|-------------------|------------| +| "What happened in my AKS cluster?" | [General Diagnostics](general-diagnostics.md) | +| "Is my cluster healthy?" | [General Diagnostics](general-diagnostics.md) | +| "Why are my pods failing?" | [Pod Failures](pod-failures.md) | +| "My app is unreachable" | [Networking](networking.md) | +| "Nodes are having issues" | [Node Issues](node-issues.md) | -1. **Start with kubectl get/describe** - Always check basic status first -2. **Check events** - `kubectl get events -A` reveals recent issues -3. **Use systematic isolation** - Pod → Node → Cluster → Network -4. **Document changes** - Note what you tried and what worked -5. **Escalate when needed** - For control plane issues, contact Azure support +> 💡 **Tip:** For AI-powered diagnostics, use AppLens MCP with the cluster resource ID — it automatically detects common issues and provides remediation recommendations. See [General Diagnostics](general-diagnostics.md) for usage details. ## Related Skills -- **azure-diagnostics** - General Azure resource troubleshooting -- **azure-deploy** - Deployment and configuration issues -- **azure-observability** - Monitoring and logging setup -``` \ No newline at end of file +- **azure-diagnostics** — General Azure resource troubleshooting +- **azure-deploy** — Deployment and configuration issues +- **azure-observability** — Monitoring and logging setup \ No newline at end of file diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/general-diagnostics.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/general-diagnostics.md new file mode 100644 index 000000000..7533e9033 --- /dev/null +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/general-diagnostics.md @@ -0,0 +1,63 @@ +# General AKS Investigation & Diagnostics + +## "What happened in my cluster?" + +When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this systematic flow: + +```bash +# 1. Cluster health +az aks show -g -n --query "provisioningState" + +# 2. Recent events +kubectl get events -A --sort-by='.lastTimestamp' | head -40 + +# 3. Node status +kubectl get nodes -o wide + +# 4. Unhealthy pods +kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded + +# 5. All pods overview +kubectl get pods -A -o wide + +# 6. System pods health +kubectl get pods -n kube-system -o wide + +# 7. Activity log +az monitor activity-log list -g --max-events 20 -o table +``` + +--- + +## AKS CLI Tools + +```bash +# Get cluster credentials (required before kubectl commands) +az aks get-credentials -g -n + +# View node pools +az aks nodepool list -g --cluster-name -o table +``` + +### AppLens (MCP) for AKS + +For AI-powered diagnostics: +``` +mcp_azure_mcp_applens + intent: "diagnose AKS cluster issues" + command: "diagnose" + parameters: + resourceId: "/subscriptions//resourceGroups//providers/Microsoft.ContainerService/managedClusters/" +``` + +> 💡 **Tip:** AppLens automatically detects common issues and provides remediation recommendations using the cluster resource ID. + +--- + +## Best Practices + +1. **Start with kubectl get/describe** — Always check basic status first +2. **Check events** — `kubectl get events -A` reveals recent issues +3. **Use systematic isolation** — Pod → Node → Cluster → Network +4. **Document changes** — Note what you tried and what worked +5. **Escalate when needed** — For control plane issues, contact Azure support diff --git a/plugin/skills/azure-diagnostics/references/azure-kubernetes/pod-failures.md b/plugin/skills/azure-diagnostics/references/azure-kubernetes/pod-failures.md new file mode 100644 index 000000000..b382b48d4 --- /dev/null +++ b/plugin/skills/azure-diagnostics/references/azure-kubernetes/pod-failures.md @@ -0,0 +1,140 @@ +# Pod Failures & Application Issues + +## Common Pod Diagnostic Commands + +```bash +# List unhealthy pods across all namespaces +kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded + +# All pods wide view +kubectl get pods -A -o wide + +# Detailed pod status — events section is critical +kubectl describe pod -n + +# Pod logs (current and previous crash) +kubectl logs -n +kubectl logs -n --previous +``` + +--- + +## CrashLoopBackOff + +Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s… up to 5m). + +**Diagnostics:** +```bash +kubectl describe pod -n +# Check: Exit Code, Reason, Last State, Events + +kubectl logs -n --previous +# Shows stdout/stderr from the last crashed container +``` + +**Decision tree:** + +| Exit Code | Meaning | Fix Path | +|-----------|---------|----------| +| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot | +| `1` | Application error | Read logs — unhandled exception, missing config, bad startup | +| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks | +| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug | +| `143` | SIGTERM — graceful shutdown | Pod was terminated; check if liveness probe killed it | + +**OOMKilled specifically:** +```bash +kubectl describe pod -n | grep -A2 "Last State" +# Reason: OOMKilled → container exceeded memory limit +``` + +Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod -n ` for actual usage. + +--- + +## ImagePullBackOff + +Pod can't pull the container image. + +**Diagnostics:** +```bash +kubectl describe pod -n +# Events section shows the exact pull error +``` + +| Error Message | Cause | Fix | +|---------------|-------|-----| +| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry | +| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account | +| `manifest unknown` | Tag doesn't exist | Check available tags in the registry | +| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS → ACR integration | + +**ACR integration check:** +```bash +# Verify AKS is attached to ACR +az aks check-acr -g -n --acr .azurecr.io +``` + +--- + +## Pending Pods + +Pod stays in `Pending` — scheduler can't place it. + +**Diagnostics:** +```bash +kubectl describe pod -n +# Events section shows why scheduling failed +``` + +| Event Message | Cause | Fix | +|---------------|-------|-----| +| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit | +| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool | +| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules | +| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists | +| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone | + +--- + +## Readiness & Liveness Probe Failures + +**Readiness probe failure** → pod removed from Service endpoints (no traffic). **Liveness probe failure** → pod killed and restarted. + +**Diagnostics:** +```bash +kubectl describe pod -n +# Look for: "Readiness probe failed" or "Liveness probe failed" in Events + +# Check the pod's READY column — must show n/n +kubectl get pod -n +``` + +| Symptom | Cause | Fix | +|---------|-------|-----| +| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint | +| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly | +| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance | + +> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing. + +--- + +## Resource Constraints (CPU/Memory) + +**Check actual usage vs limits:** +```bash +kubectl top pod -n +kubectl top pod -n --sort-by=memory + +# Compare with requests/limits +kubectl get pod -n -o jsonpath='{.spec.containers[*].resources}' +``` + +| Symptom | Cause | Fix | +|---------|-------|-----| +| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak | +| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits | +| Pending — insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes | + +> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.