Update azure-diagnostics Kubernetes references and snapshots#1146
Update azure-diagnostics Kubernetes references and snapshots#1146sammsft1 wants to merge 5 commits intomicrosoft:mainfrom
Conversation
|
@sammsft1 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Enhances the azure-diagnostics skill’s AKS troubleshooting coverage by adding Kubernetes reference docs and updating the trigger snapshot to reflect the expanded scope.
Changes:
- Expanded
azure-diagnosticsskill description and added AKS as a supported troubleshooting area. - Added AKS reference documentation for networking and node/cluster issue diagnosis.
- Updated trigger snapshot keywords to align with the new description/content.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/azure-diagnostics/snapshots/triggers.test.ts.snap | Updates trigger description/keyword snapshot to include AKS-related terms |
| plugin/skills/azure-diagnostics/SKILL.md | Extends skill scope to AKS and links to new AKS references |
| plugin/skills/azure-diagnostics/references/azure-kubernetes/README.md | Adds a top-level AKS troubleshooting landing page and common commands |
| plugin/skills/azure-diagnostics/references/azure-kubernetes/networking.md | Adds networking troubleshooting decision trees and diagnostic commands |
| plugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.md | Adds node/cluster troubleshooting guidance (NotReady, autoscaler, upgrades, etc.) |
| "crashing", | ||
| "creating", | ||
| "debug", | ||
| "deploy", | ||
| "deploying", |
There was a problem hiding this comment.
The trigger keywords now include "deploy", which conflicts with the skill description’s “DO NOT USE FOR: deploying applications (use azure-deploy)”. This can cause mis-routing by matching user prompts about deploying. Consider adjusting the keyword extraction / trigger config so terms from the “DO NOT USE” section (or deploy-related stems like deploy) are excluded, then update the snapshot accordingly.
| |---------|-------|-----| | ||
| | ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` | | ||
| | 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` | | ||
| | 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | |
There was a problem hiding this comment.
This guidance is technically incorrect for Kubernetes Ingress pathType: Prefix semantics—/api should match /api as well as /api/... (and not match /apix). Please correct the statement so users don’t apply unnecessary or incorrect ingress rules.
| | 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` — add both | | |
| | 404 for some paths | Path / pathType mismatch | With `pathType: Prefix`, `/api` matches `/api` and `/api/...` but not `/apix`; with `Exact`, `/api` and `/api/` are different | |
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]} | ||
| {.metadata.name}{"\n"}{end}' |
There was a problem hiding this comment.
The JSONPath example includes an embedded newline inside single quotes, which is easy to copy/paste incorrectly and can be confusing to readers. Consider rewriting this command as a single-line JSONPath (or use an explicit line continuation) to make it more robust for documentation consumers.
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]} | |
| {.metadata.name}{"\n"}{end}' | |
| kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}' |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| - **azure-observability** - Monitoring and logging setup | ||
| ``` No newline at end of file |
There was a problem hiding this comment.
There is a trailing triple-backtick fence at the end of the file without a corresponding opening fence. This will break Markdown rendering for the tail of the document; remove the dangling fence.
| - **azure-observability** - Monitoring and logging setup | |
| ``` | |
| - **azure-observability** - Monitoring and logging setup |
| @@ -0,0 +1,153 @@ | |||
| # Networking Troubleshooting | |||
|
|
|||
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. | |||
There was a problem hiding this comment.
This path is likely incorrect relative to references/azure-kubernetes/ (it would resolve to references/azure-kubernetes/references/networking-cni.md). Use a correct relative path (or a proper Markdown link) so readers can navigate to the intended doc.
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see `references/networking-cni.md`. | |
| > For CNI-specific issues (IP exhaustion, Azure CNI Overlay, eBPF/Cilium, egress/UDR, private cluster egress), see [networking CNI troubleshooting](./networking-cni.md). |
| | Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys | | ||
| | Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings | | ||
|
|
||
| **ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See `references/storage.md` for the ZRS StorageClass spec. |
There was a problem hiding this comment.
The referenced path references/storage.md is likely incorrect relative to this directory (it would resolve under references/azure-kubernetes/references/). Update the reference to the correct relative path (and preferably make it a clickable Markdown link).
| **ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See `references/storage.md` for the ZRS StorageClass spec. | |
| **ZRS storage for multi-AZ (recommended):** Prevents zone affinity conflicts on disk PVCs. Use `Premium_ZRS` or `StandardSSD_ZRS` as the `skuname` in a custom StorageClass. See [storage.md](../storage.md) for the ZRS StorageClass spec. |
| kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \ | ||
| curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz |
There was a problem hiding this comment.
The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.
| kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \ | ||
| nslookup <service-name>.<ns>.svc.cluster.local |
There was a problem hiding this comment.
The examples use an unpinned image (curlimages/curl) and a very old BusyBox tag (busybox:1.28). For safer, reproducible troubleshooting commands, pin to a specific, current tag (and consider a more security-focused debug image) to reduce the risk of pulling vulnerable or behavior-changing images.
| | Condition | Value | Meaning | Fix Path | | ||
| |-----------|-------|---------|----------| | ||
| | `Ready` | `False` | kubelet stopped reporting | SSH to node or cordon + drain + delete | |
There was a problem hiding this comment.
Markdown tables are written with a double leading pipe (||), which renders as an extra empty first column in GitHub Markdown. Replace || with a single leading | across these tables so they render with the intended columns.
| | Observation | Cause | Fix | | ||
| |-------------|-------|-----| | ||
| | Endpoints shows `<none>` | Label selector mismatch | Align selector with pod labels; check for typos | |
There was a problem hiding this comment.
Same table formatting issue here: the double leading pipe (||) creates an extra empty column. Use single | for proper GitHub Markdown table rendering.
| --- | ||
| name: azure-diagnostics | ||
| description: "Debug and troubleshoot production issues on Azure. Covers Container Apps and Function Apps diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, and function invocation failures. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" | ||
| description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" |
There was a problem hiding this comment.
The updated trigger snapshot now includes deploy as an extracted keyword, even though deployment requests are explicitly in the “DO NOT USE” section. This can increase false-positive routing of deployment-related prompts to azure-diagnostics. Consider rephrasing the “DO NOT USE” section (or formatting skill names) to avoid emitting bare deployment keywords, or adjust the keyword extraction to ignore the negative-use section.
| description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)" | |
| description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking. AVOID USING FOR: application rollouts or releases (instead, use the `azure-deploy` skill), creating new resources or environments (use the `azure-prepare` skill), configuring or tuning monitoring (use the `azure-observability` skill), or cost analysis and optimization (use the `azure-cost-optimization` skill)." |
Also, updating the networking and node issues references to provide more comprehensive troubleshooting steps for Kubernetes on Azure.
…sammsft1/GitHub-Copilot-for-Azure into feature/troubleshoot-kubernetes
| 4. **Isolate the issue** - Pod-level, node-level, or cluster-level? | ||
| 5. **Apply targeted fixes** - Use the appropriate troubleshooting section | ||
|
|
||
| ## Troubleshooting Sections |
There was a problem hiding this comment.
Can you add the reference files directly in this section - which one aligns with which troubleshooting section?
|
|
||
| ## Troubleshooting Sections | ||
|
|
||
| ### Pod Failures & Application Issues |
There was a problem hiding this comment.
Can we take out any categories which we don't have reference files for temporarily, we can add them back in when we create the reference files. So pod issues, cluster issues should be removed for now
|
|
||
| --- | ||
|
|
||
| ## Common Diagnostic Commands |
There was a problem hiding this comment.
We should move the commands and any general advice to the actual reference files. The README should be very concise and serve as redirection to the right reference file depending on the troubleshooting scenario
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
plugin/skills/azure-diagnostics/SKILL.md:7
metadata.versionis still1.0.0even though this skill was modified. Per skill authoring guidelines, bump the semver version in the frontmatter in the same PR wheneverSKILL.mdchanges.
description: "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and Azure Kubernetes Service diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)"
license: MIT
metadata:
author: Microsoft
version: "1.0.0"
| ## Common Pod Diagnostic Commands | ||
|
|
||
| ```bash | ||
| # List unhealthy pods across all namespaces |
There was a problem hiding this comment.
This command is labeled as listing “unhealthy pods”, but --field-selector=status.phase!=Running,... will miss many unhealthy pods that are still in phase Running (e.g., CrashLoopBackOff, Running but not Ready). Consider renaming the comment to “non-running pods” and/or adding an additional check for NotReady/CrashLoopBackOff pods (e.g., filter by READY column or status reason).
| # List unhealthy pods across all namespaces | |
| # List non-running pods across all namespaces |
| # 4. Unhealthy pods | ||
| kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | ||
|
|
||
| # 5. All pods overview | ||
| kubectl get pods -A -o wide | ||
|
|
||
| # 6. System pods health | ||
| kubectl get pods -n kube-system -o wide | ||
|
|
||
| # 7. Activity log |
There was a problem hiding this comment.
In the “What happened in my cluster?” flow, step 4 is labeled “Unhealthy pods” but uses a phase-based field selector that only catches non-Running/non-Succeeded pods and will miss common AKS issues where pods are Running but unhealthy (CrashLoopBackOff, readiness probe failures, etc.). Update the wording or add an additional command to surface Running-but-unhealthy pods.
| # 4. Unhealthy pods | |
| kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | |
| # 5. All pods overview | |
| kubectl get pods -A -o wide | |
| # 6. System pods health | |
| kubectl get pods -n kube-system -o wide | |
| # 7. Activity log | |
| # 4. Non-running pods | |
| kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | |
| # 5. Running but unhealthy pods (CrashLoopBackOff, failing probes, etc.) | |
| kubectl get pods -A | egrep 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' | |
| # 6. All pods overview | |
| kubectl get pods -A -o wide | |
| # 7. System pods health | |
| kubectl get pods -n kube-system -o wide | |
| # 8. Activity log |
| exports[`azure-diagnostics - Trigger Tests Trigger Keywords Snapshot skill description triggers match snapshot 1`] = ` | ||
| { | ||
| "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps and Function Apps diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, and function invocation failures. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", | ||
| "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", |
There was a problem hiding this comment.
The snapshot description still says "AKS diagnostics", but plugin/skills/azure-diagnostics/SKILL.md frontmatter now says "Azure Kubernetes Service diagnostics". This looks like an out-of-date snapshot and will fail tests; regenerate the trigger snapshots from the updated skill content/metadata.
| "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and AKS diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", | |
| "description": "Debug and troubleshoot production issues on Azure. Covers Container Apps, Function Apps, and Azure Kubernetes Service diagnostics, log analysis with KQL, health checks, and common issue resolution for image pulls, cold starts, health probes, pod failures, node issues, and networking problems. USE FOR: debug production issues, troubleshoot container apps, troubleshoot function apps, troubleshoot Azure Functions, troubleshoot AKS, troubleshoot kubernetes, analyze logs with KQL, fix image pull failures, resolve cold start issues, investigate health probe failures, check resource health, view application logs, find root cause of errors, function app not working, function invocation failures, pod crashing, node not ready, kubernetes networking DO NOT USE FOR: deploying applications (use azure-deploy), creating new resources (use azure-prepare), setting up monitoring (use azure-observability), cost optimization (use azure-cost-optimization)", |
| "debug", | ||
| "deploy", | ||
| "deploying", | ||
| "diagnostic", |
There was a problem hiding this comment.
The snapshot includes the keyword deploy, but with the current TriggerMatcher implementation this keyword is only added if it appears in skill.content (SKILL.md body) or as a >3-char word in metadata.description. azure-diagnostics/SKILL.md body doesn't contain "deploy" and the description contains "deploying"/"azure-deploy" (not deploy), so this entry likely indicates the snapshot wasn’t regenerated from the current sources.
|
@sammsft1 Please resolve the merge conflicts and address copilots. Explain what change you made or why you chose not to make any change for each comment. |
Summary
This PR enhances the
azure-diagnosticsskill with Azure Kubernetes (AKS) troubleshooting coverage by adding dedicated reference documentation and updating trigger snapshots.What changed
Updated
plugin/skills/azure-diagnostics/SKILL.mdAdded AKS reference docs:
plugin/skills/azure-diagnostics/references/azure-kubernetes/README.mdplugin/skills/azure-diagnostics/references/azure-kubernetes/networking.mdplugin/skills/azure-diagnostics/references/azure-kubernetes/node-issues.mdUpdated snapshot:
tests/azure-diagnostics/__snapshots__/triggers.test.ts.snapWhy
Users troubleshooting AKS workloads (pod failures, node issues, networking problems) should be guided by
azure-diagnosticswith clearer, service-specific references and expected trigger behavior.Validation
Notes