Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions plugin/skills/azure-diagnostics/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: azure-diagnostics
description: "Debug and troubleshoot production issues on Azure. Covers Container Apps and Function Apps diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, and function invocation failures. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)"
description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated trigger snapshot now includes deploy as an extracted keyword, even though deployment requests are explicitly in the “DO NOT USE” section. This can increase false-positive routing of deployment-related prompts to azure-diagnostics. Consider rephrasing the “DO NOT USE” section (or formatting skill names) to avoid emitting bare deployment keywords, or adjust the keyword extraction to ignore the negative-use section.

Suggested change
description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)"
description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking. AVOID USING FOR: application rollouts or releases (instead, use the `azure-deploy` skill), creating new resources or environments (use the `azure-prepare` skill), configuring or tuning monitoring (use the `azure-observability` skill), or cost analysis and optimization (use the `azure-cost-optimization` skill)."

Copilot uses AI. Check for mistakes.
license: MIT
metadata:
author: Microsoft
Expand Down Expand Up @@ -51,6 +51,7 @@ Activate this skill when user wants to:
|---------|---------------|-----------|
| **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) |
| **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) |
| **Azure Kubernetes** | Pod failures, node issues, networking problems, Helm deployment errors | [azure-kubernetes/](references/azure-kubernetes/README.md) |

---

Expand Down Expand Up @@ -133,4 +134,5 @@ az monitor activity-log list -g RG --max-events 20

- [KQL Query Library](references/kql-queries.md)
- [Azure Resource Graph Queries](references/azure-resource-graph.md)
- [Function Apps Troubleshooting](references/functions/README.md)
- [Function Apps Troubleshooting](references/functions/README.md)
- [Azure Kubernetes Troubleshooting](references/azure-kubernetes/README.md)
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Azure Kubernetes Service (AKS) Troubleshooting

> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This document is the **official source** for debugging and troubleshooting Azure Kubernetes Service (AKS) production issues. Follow these instructions to diagnose and resolve common AKS problems systematically.

## Overview

AKS troubleshooting covers pod failures, node issues, networking problems, and cluster-level failures. This guide provides systematic diagnosis flows and remediation steps for the most common issues.

## Quick Diagnosis Flow

1. **Identify symptoms** - What's failing? (Pods, nodes, networking, services?)
2. **Check cluster health** - Is AKS control plane healthy?
3. **Review events and logs** - What do Kubernetes events show?
4. **Isolate the issue** - Pod-level, node-level, or cluster-level?
5. **Apply targeted fixes** - Use the appropriate troubleshooting section

## Troubleshooting Sections

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the reference files directly in this section - which one aligns with which troubleshooting section?


### Pod Failures & Application Issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take out any categories which we don't have reference files for temporarily, we can add them back in when we create the reference files. So pod issues, cluster issues should be removed for now

- CrashLoopBackOff, ImagePullBackOff, Pending pods
- Readiness/liveness probe failures
- Resource constraints (CPU/memory limits)

### Node & Cluster Issues
- Node NotReady conditions
- Autoscaling failures
- Resource pressure and capacity planning
- Upgrade problems

### Networking Problems
- Service unreachable/connection refused
- DNS resolution failures
- Load balancer issues
- Ingress routing failures
- Network policy blocking

## References

- [Networking Troubleshooting](networking.md)
- [Node & Cluster Troubleshooting](node-issues.md)

## Common Diagnostic Commands

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the commands and any general advice to the actual reference files. The README should be very concise and serve as redirection to the right reference file depending on the troubleshooting scenario


```bash
# Cluster overview
kubectl get nodes -o wide
kubectl get pods -A -o wide
kubectl get events -A --sort-by='.lastTimestamp'

# Pod diagnostics
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Node diagnostics
kubectl describe node <node-name>
kubectl get pods -n kube-system -o wide

# Networking diagnostics
kubectl get svc -A
kubectl get endpoints -A
kubectl get networkpolicy -A
```

## AKS-Specific Tools

### Azure CLI Diagnostics
```bash
# Check cluster status
az aks show -g <rg> -n <cluster> --query "provisioningState"

# Get cluster credentials
az aks get-credentials -g <rg> -n <cluster>

# View node pools
az aks nodepool list -g <rg> --cluster-name <cluster> -o table
```

### AppLens (MCP) for AKS
For AI-powered diagnostics:
```
mcp_azure_mcp_applens
intent: "diagnose AKS cluster issues"
command: "diagnose"
parameters:
resourceId: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>"
```

## Best Practices

1. **Start with kubectl get/describe** - Always check basic status first
2. **Check events** - `kubectl get events -A` reveals recent issues
3. **Use systematic isolation** - Pod → Node → Cluster → Network
4. **Document changes** - Note what you tried and what worked
5. **Escalate when needed** - For control plane issues, contact Azure support

## Related Skills

- **azure-diagnostics** - General Azure resource troubleshooting
- **azure-deploy** - Deployment and configuration issues
- **azure-observability** - Monitoring and logging setup
```
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a trailing triple-backtick fence at the end of the file without a corresponding opening fence. This will break Markdown rendering for the tail of the document; remove the dangling fence.

Suggested change
- **azure-observability** - Monitoring and logging setup
```
- **azure-observability** - Monitoring and logging setup

Copilot uses AI. Check for mistakes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Networking Troubleshooting

> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`.
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is likely incorrect relative to references/azure-kubernetes/ (it would resolve to references/azure-kubernetes/references/networking-cni.md). Use a correct relative path (or a proper Markdown link) so readers can navigate to the intended doc.

Suggested change
> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`.
> For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see [networking CNI troubleshooting](./networking-cni.md).

Copilot uses AI. Check for mistakes.

## Service Unreachable / Connection Refused

**Diagnostics — always start here:**
```bash
# 1. Verify service exists and has endpoints
kubectl get svc <service-name> -n <ns>
kubectl get endpoints <service-name> -n <ns>

# 2. Test connectivity from inside the namespace
kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz
Comment on lines +14 to +15
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.

Copilot uses AI. Check for mistakes.
```

**Decision tree:**

| Observation | Cause | Fix |
|-------------|-------|-----|
| Endpoints shows `<none>` | Label selector mismatch | Align selector with pod labels; check for typos |
Comment on lines +20 to +22
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same table formatting issue here: the double leading pipe (||) creates an extra empty column. Use single | for proper GitHub Markdown table rendering.

Copilot uses AI. Check for mistakes.
| Endpoints has IPs but unreachable | Port mismatch or app not listening | Confirm `targetPort` = actual container port |
| Works from some pods, fails from others | Network policy blocking | See Network Policy section |
| Works inside cluster, fails externally | Load balancer issue | See Load Balancer section |
| `ECONNREFUSED` immediately | App not listening on that port | `kubectl exec <pod> -- netstat -tlnp` |

**Running but not Ready = removed from Endpoints silently.** Check `kubectl get pod <pod> -n <ns>` — READY must show `n/n`. If not, readiness probe is failing; fix probe or app health endpoint.

---

## DNS Resolution Failures

**Diagnostics:**
```bash
# Confirm CoreDNS is running and healthy
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl top pod -n kube-system -l k8s-app=kube-dns # Check if CPU-throttled

# Live DNS test from the same namespace as the failing pod
kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \
nslookup <service-name>.<ns>.svc.cluster.local
Comment on lines +41 to +42
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.

Copilot uses AI. Check for mistakes.


# CoreDNS logs — errors show here first
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
```

**CoreDNS configmap:** `kubectl get configmap coredns -n kube-system -o yaml` — check `forward` plugin (upstream DNS), `cache` TTL, and any custom rewrites.

**AKS DNS failure patterns:**

| Symptom | Cause | Fix |
|---------|-------|-----|
| `NXDOMAIN` for `svc.cluster.local` | CoreDNS down or pod network broken | Restart CoreDNS pods; check CNI |
| Internal resolves, external NXDOMAIN | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder |
| Intermittent SERVFAIL under load | CoreDNS CPU throttled | Remove CPU limits or add replicas |
| Private cluster — external names fail | Custom DNS missing privatelink forwarder | Add conditional forwarder to Azure DNS |
| `i/o timeout` not `NXDOMAIN` | Port 53 blocked by NetworkPolicy or NSG | Allow UDP/TCP 53 from pods to kube-dns ClusterIP |

**Custom DNS on VNet — the most common AKS DNS trap:**
Custom VNet DNS servers must forward `.cluster.local` to the CoreDNS ClusterIP and everything else to `168.63.129.16`. Breaking either path causes split DNS failures.
```bash
kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
# This IP must be the forward target for cluster.local in your custom DNS
```

**CoreDNS under load:** Check `kubectl get hpa coredns -n kube-system` and `kubectl top pod -n kube-system -l k8s-app=kube-dns`. If CPU-throttled and no HPA, manually scale: `kubectl scale deployment coredns -n kube-system --replicas=3`.

---

## Load Balancer Stuck in Pending

**Diagnostics:**
```bash
kubectl describe svc <svc> -n <ns>
# Events section reveals the actual Azure error

kubectl logs -n kube-system -l component=cloud-controller-manager --tail=100
# Azure cloud provider logs — more detail than kubectl events
```

**Error decision table:**

| Error in Events / CCM Logs | Cause | Fix |
|----------------------------|-------|-----|
| `InsufficientFreeAddresses` | Subnet has no free IPs | Expand subnet CIDR; use Azure CNI Overlay; use NAT gateway instead |
| `ensure(default/svc): failed... PublicIPAddress quota` | Public IP quota exhausted | Request quota increase for Public IP Addresses in the region |
| `cannot find NSG` | NSG name changed or detached | Re-associate NSG to the AKS subnet; check `az aks show` for NSG name |
| `reconciling NSG rules: failed` | NSG is locked or has conflicting rules | Remove resource lock; check for deny-all rules above AKS-managed rules |
| `subnet not found` | Wrong subnet name in annotation | Verify subnet name: `az network vnet subnet list -g <rg> --vnet-name <vnet>` |
| No events, stuck Pending | CCM can't authenticate to Azure | Check cluster managed identity has `Network Contributor` on the VNet resource group |

**Internal LB annotations:** Set `service.beta.kubernetes.io/azure-load-balancer-internal: "true"` and `azure-load-balancer-internal-subnet: "<subnet-name>"`. Add `azure-load-balancer-ipv4: "10.x.x.x"` for a static private IP.

**CCM identity check:** If no events and LB is stuck, verify the cluster's managed identity has `Network Contributor` on the VNet resource group: `az aks show -g <rg> -n <cluster> --query "identity.principalId" -o tsv` then check role assignments.

---

## Ingress Not Routing Traffic

**Diagnostics:**
```bash
# Confirm controller is running
kubectl get pods -n <ingress-ns> -l 'app.kubernetes.io/name in (ingress-nginx,nginx-ingress)'
kubectl logs -n <ingress-ns> -l app.kubernetes.io/name=ingress-nginx --tail=100

# Check the ingress resource state
kubectl describe ingress <name> -n <ns>
kubectl get ingress <name> -n <ns> # ADDRESS must be populated

# Check backend
kubectl get endpoints <backend-svc> -n <ns>
```

**Ingress failure patterns:**

| Symptom | Cause | Fix |
|---------|-------|-----|
| ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` |
| 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` |
| 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both |
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guidance is technically incorrect for Kubernetes Ingress pathType: Prefix semantics—/api should match /api as well as /api/... (and not match /apix). Please correct the statement so users don’t apply unnecessary or incorrect ingress rules.

Suggested change
| 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both |
| 404 for some paths | Path / pathType mismatch | With `pathType: Prefix`, `/api` matches `/api` and `/api/...` but not `/apix`; with `Exact`, `/api` and `/api/` are different |

Copilot uses AI. Check for mistakes.
| 502 Bad Gateway | Backend pods unhealthy or wrong port | Verify Endpoints has IPs; confirm `targetPort` and readiness |
| 503 Service Unavailable | All backend pods down | Check pod restarts and readiness probe |
| TLS handshake fail | cert-manager not issuing | `kubectl describe certificate -n <ns>`; check ACME challenge |
| Works for host-a, 404 for host-b | DNS not pointing to ingress IP | `nslookup <host>` must resolve to ingress ADDRESS |

**Application Routing add-on:** `az aks show -g <rg> -n <cluster> --query "ingressProfile"` — if enabled, use `ingressClassName: webapprouting.kubernetes.azure.com`.

---

## Network Policy Blocking Traffic

**Finding which policy is blocking (the hard part):**
```bash
# List all policies in the namespace — check both ingress and egress
kubectl get networkpolicy -n <ns> -o yaml

# Check for a default-deny policy (blocks everything unless explicitly allowed)
kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}
{.metadata.name}{"\n"}{end}'
Comment on lines +140 to +141
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSONPath example includes an embedded newline inside single quotes, which is easy to copy/paste incorrectly and can be confusing to readers. Consider rewriting this command as a single-line JSONPath (or use an explicit line continuation) to make it more robust for documentation consumers.

Suggested change
kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}
{.metadata.name}{"\n"}{end}'
kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}'

Copilot uses AI. Check for mistakes.

# Simulate traffic to identify the block
kubectl run probe --image=curlimages/curl -n <source-ns> -it --rm -- \
curl -v --connect-timeout 3 http://<target-pod-ip>:<port>
# Timeout = network policy blocking. Connection refused = reached pod but app issue.
```

**Policy audit checklist:** (1) Get source pod labels. (2) Get destination pod labels. (3) Check destination namespace for ingress policy — does it allow from source labels? (4) Check source namespace for egress policy — does it allow to destination labels? Both directions need explicit allow rules if default-deny exists.

**AKS network policy engine check:** Azure NPM (Azure CNI): `kubectl get pods -n kube-system -l k8s-app=azure-npm`. Calico: `kubectl get pods -n calico-system`.

**Common default-deny escape:** Always add an egress policy allowing UDP/TCP port 53 to the kube-dns service IP — this is the most frequently forgotten rule when adding a default-deny NetworkPolicy.
Loading
Loading