Post-install customizations may be needed as systems scale. These customizations also need to persist across future installs or upgrades. Not all resources can be customized post-install; common scenarios are documented in the following sections.
The following is a guide for determining where issues may exist, how to adjust the resources, and how to ensure the changes will persist. Different values may be be needed for systems as they scale.
- System domain name
kubectl
eventsOOMKilled
- Vmstorage
CPUThrottlingHigh
alerts - Grafana "Kubernetes / Compute Resources / Pod" dashboard
- Common customization scenarios
The SYSTEM_DOMAIN_NAME
value found in some of the URLs on this page is expected to be the system's fully qualified domain name (FQDN).
(ncn-mw#
) The FQDN can be found by running the following command on any Kubernetes NCN.
kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | base64 -d | yq r - spec.network.dns.external
Example output:
system.hpc.amslabs.hpecorp.net
Be sure to modify the example URLs on this page by replacing SYSTEM_DOMAIN_NAME
with the actual value found using the above command.
Check to see if there are any recent out of memory events.
-
(
ncn-mw#
) Checkkubectl
events to see if there are any recent out of memory events.kubectl get event -A | grep OOM
-
Log in to Grafana at the following URL:
https://grafana.cmn.SYSTEM_DOMAIN_NAME/
-
Search for the "Kubernetes / Compute Resources / Pod" dashboard to view the memory utilization graphs over time for any pod that has been
OOMKilled
.
Check Prometheus for recent CPUThrottlingHigh
alerts.
-
Log in to vmalert at the following URL:
https://vmselect.cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/api/v1/alerts
- Scroll down to the alert for
CPUThrottlingHigh
.
- Scroll down to the alert for
-
Log in to Grafana at the following URL:
https://grafana.cmn.SYSTEM_DOMAIN_NAME/
- Search for the "Kubernetes / Compute Resources / Pod" dashboard to view the throttling graphs over time for any pod that is alerting.
Use Grafana to investigate and analyze CPU throttling and memory usage.
-
Log in to Grafana at the following URL:
https://grafana.cmn.SYSTEM_DOMAIN_NAME/
-
Search for the "Kubernetes / Compute Resources / Pod" dashboard.
-
Select the
datasource
,namespace
, andpod
based on the pod being examined.For example:
datasource: default namespace: sysmgmt-health pod: vmstorage-vms-0
-
Select the CPU Throttling drop-down to see the CPU Throttling graph for the pod during the selected time (from the top right).
-
Select the container (from the legends under the x axis).
-
Review the graph and adjust the
resources.limits.cpu
value as needed.The presence of CPU throttling does not always indicate a problem, but if a service is being slow or experiencing latency issues, adjusting
resources.limits.cpu
may be beneficial.For example:
- If the pod is being throttled at or near 100% for any period of time, then adjustments are likely needed.
- If the service's response time is critical, then adjusting the pod's resources to greatly reduce or eliminate any CPU throttling may be required.
NOTE: The
resources.requests.cpu
values are used by the Kubernetes scheduler to decide which node to place the pod on and do not impact CPU throttling. The value ofresources.limits.cpu
can never be lower than the value ofresources.requests.cpu
.
-
Select the Memory Usage drop-down to see the memory usage graph for the pod during the selected time (from the top right).
-
Select the container (from the legends under the x axis).
-
Determine the steady state memory usage by looking at the memory usage graph for the container.
This is where the
resources.requests.memory
value should be minimally set. But more importantly, determine the spike usage for the container and set theresources.limits.memory
value based on the spike values with some additional headroom.
- Prerequisites
- Vmstorage pods are
OOMKilled
or CPU throttled - Postgres pods are
OOMKilled
or CPU throttled - Scale
cray-bss
service - Scale
cray-dns-unbound
service - Postgres PVC resize
- Vmstorage PVC resize
cray-hms-hmcollector
pods areOOMKilled
cray-cfs-api
pods areOOMKilled
- References
Most of these procedures instruct the administrator to perform the Redeploying a Chart procedure for a specific chart. In these cases, the section on this page provides the administrator with the information necessary in order to carry out that procedure. It is recommended to keep both pages open in different browser windows for easy reference.
Update resources associated with Prometheus in the sysmgmt-health
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
-
Chart name:
cray-sysmgmt-health
-
Base manifest name:
platform
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Edit the customizations by adding or updating
spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources
.-
If the number of NCNs is less than 20, then:
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '4' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '8Gi' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '8' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '16Gi'
-
If the number of NCNs is 20 or more, then:
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '6' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '16Gi' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '12' yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '32Gi'
-
-
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources'
Example output:
requests: cpu: "4" memory: 8Gi limits: cpu: "8" memory: 16Gi
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Verify that the pod restarts and that the desired resources have been applied.
Watch the
vmstorage-vms
pods restart.watch "kubectl get pods -n sysmgmt-health -l app.kubernetes.io/name=vmstorage"
It may take about 10 minutes for the
vmstorage-vms-*
pods to terminate. It can be forced deleted if it remains in the terminating state:kubectl delete pod vmstorage-vms-0 --force --grace-period=0 -n sysmgmt-health kubectl delete pod vmstorage-vms-1 --force --grace-period=0 -n sysmgmt-health
-
Verify that the resource changes are in place.
kubectl get pod vmstorage-vms-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "vmstorage").resources'
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Update resources associated with cray-spire-postgres
in the spire
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
A similar flow can be used to update the resources for cray-sls-postgres
, cray-smd-postgres
, or gitea-vcs-postgres
.
The following table provides values the administrator will need based on which pods are experiencing problems.
Chart name | Base manifest name | Resource path name | Kubernetes namespace |
---|---|---|---|
cray-sls-postgres |
core-services |
cray-hms-sls |
services |
cray-smd-postgres |
core-services |
cray-hms-smd |
services |
gitea-vcs-postgres |
sysmgmt |
gitea |
services |
cray-spire-postgres |
sysmgmt |
cray-spire |
spire |
Using the values from the above table, follow the Redeploying a Chart with the following specifications:
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Set the
rpname
variable to the appropriate resource path name from the table above.rpname=<put resource path name from table here>
-
Edit the customizations by adding or updating
spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources
.yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.cpu" --style=double '4' yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.memory" '4Gi' yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.cpu" --style=double '8' yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.memory" '8Gi'
-
Check that the customization file has been updated.
yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources"
Example output:
requests: cpu: "4" memory: 4Gi limits: cpu: "8" memory: 8Gi
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the pods restart and that the desired resources have been applied. Commands in this section use the
$CHART_NAME
variable which should have been set as part of the Redeploying a Chart procedure.-
Set the
ns
variable to the name of the appropriate Kubernetes namespace from the earlier table.ns=<put kubernetes namespace here>
-
Watch the pod restart.
watch "kubectl get pods -n ${ns} -l application=spilo,cluster-name=${CHART_NAME}"
-
Verify that the desired resources have been applied.
kubectl get pod ${CHART_NAME}-0 -n "${ns}" -o json | jq -r '.spec.containers[] | select(.name == "postgres").resources'
Example output:
{ "limits": { "cpu": "8", "memory": "8Gi" }, "requests": { "cpu": "4", "memory": "4Gi" } }
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Scale the replica count associated with the cray-bss
service in the services
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
-
Chart name:
cray-hms-bss
-
Base manifest name:
sysmgmt
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Edit the customizations by adding or updating
spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount
.yq write -i customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount' '5'
-
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount'
Example output:
5
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
Verify the
cray-bss
pods scale.-
Watch the
cray-bss
pods scale to the desired number (in this example, 5), with each pod reaching a2/2
ready state.watch "kubectl get pods -l app.kubernetes.io/instance=cray-hms-bss -n services"
Example output:
NAME READY STATUS RESTARTS AGE cray-bss-fccbc9f7d-7jw2q 2/2 Running 0 82m cray-bss-fccbc9f7d-l524g 2/2 Running 0 93s cray-bss-fccbc9f7d-qwzst 2/2 Running 0 93s cray-bss-fccbc9f7d-sw48b 2/2 Running 0 82m cray-bss-fccbc9f7d-xr26l 2/2 Running 0 82m
-
Verify that the replicas change is present in the Kubernetes
cray-bss
deployment.kubectl get deployment cray-bss -n services -o json | jq -r '.spec.replicas'
In this example,
5
will be the returned value.
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Scale the replica count associated with the cray-dns-unbound
service in the services
namespace.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
-
Chart name:
cray-dns-unbound
-
Base manifest name:
core-services
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Edit the customizations by adding or updating
spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount
.yq write -i customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount' '5'
-
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount'
Example output:
5
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
Verify the
cray-dns-unbound
pods scale.-
Watch the
cray-dns-unbound
pods scale to the desired number (in this example, 5), with each pod reaching a3/3
ready state.watch "kubectl get pods -l app.kubernetes.io/instance=cray-dns-unbound -n services"
Example output:
NAME READY STATUS RESTARTS AGE cray-dns-unbound-58b5cfdb4d-6vwrx 3/3 Running 0 88s cray-dns-unbound-58b5cfdb4d-6wrpr 3/3 Running 0 87s cray-dns-unbound-58b5cfdb4d-7ndhg 3/3 Running 0 70m cray-dns-unbound-58b5cfdb4d-n498k 3/3 Running 0 70m cray-dns-unbound-58b5cfdb4d-w2tq9 3/3 Running 0 70m
-
Verify that the replicas change is present in the Kubernetes
cray-dns-unbound
deployment.kubectl get deployment cray-dns-unbound -n services -o json | jq -r '.spec.replicas'
In this example,
5
will be the returned value.
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Increase the PVC volume size associated with cray-smd-postgres
cluster in the services
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale. The PVC size can only ever be increased.
A similar flow can be used to update the resources for cray-sls-postgres
, gitea-vcs-postgres
, or cray-spire-postgres
.
The following table provides values the administrator will need based on which pods are experiencing problems.
Chart name | Base manifest name | Resource path name | Kubernetes namespace |
---|---|---|---|
cray-sls-postgres |
core-services |
cray-hms-sls |
services |
cray-smd-postgres |
core-services |
cray-hms-smd |
services |
gitea-vcs-postgres |
sysmgmt |
gitea |
services |
cray-spire-postgres |
sysmgmt |
cray-spire |
spire |
Using the values from the above table, follow the Redeploying a Chart with the following specifications:
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Set the
rpname
variable to the appropriate resource path name from the table above.rpname=<put resource path name from table here>
-
Edit the customizations by adding or updating
spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize
.yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize" '100Gi'
-
Check that the customization file has been updated.
yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize"
Example output:
100Gi
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the pods restart and that the desired resources have been applied. Commands in this section use the
$CHART_NAME
variable which should have been set as part of the Redeploying a Chart procedure.-
Set the
ns
variable to the name of the appropriate Kubernetes namespace from the earlier table.ns=<put kubernetes namespace here>
-
Verify that the increased volume size has been applied.
watch "kubectl get postgresql ${CHART_NAME} -n $ns"
Example output:
NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS cray-smd-postgres cray-smd 11 3 100Gi 500m 8Gi 45m Running
-
If the status on the above command is
SyncFailed
instead ofRunning
, refer to Case 1 in theSyncFailed
section of Troubleshoot Postgres Database.At this point the Postgres cluster is healthy, but additional steps are required to complete the resize of the Postgres PVCs.
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Increase the PVC volume size associated with vmstorage
cluster in the sysmgmt-health
namespace.
This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.
Follow the Redeploying a Chart procedure with the following specifications:
-
Chart name:
cray-sysmgmt-health
-
Base manifest name:
platform
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Edit the customizations by adding or updating
spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage
.yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
-
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage'
Example output:
300Gi
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following step:Only follow this step as part of the previously linked chart redeploy procedure.
Verify that the increased volume size has been applied.
watch "kubectl get pvc -n sysmgmt-health vmstorage-db-vmstorage-vms-0"
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE vmstorage-db-vmstorage-vms-0 Bound pvc-092805e3-ac92-438e-b77b-a0639096f5f0 200Gi RWO k8s-block-replicated 3d2h
At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Update resources associated with cray-hms-hmcollector
in the services
namespace.
Trial and error may be needed to determine what is best for a given system at scale.
See Adjust HM Collector Ingress Replicas and Resource Limits.
Increase the memory requests and limits associated with the cray-cfs-api
deployment in the services
namespace.
Follow the Redeploying a Chart procedure with the following specifications:
-
Chart name:
cray-cfs-api
-
Base manifest name:
sysmgmt
-
(
ncn-mw#
) When reaching the step to update the customizations, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Edit the customizations by adding or updating
spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources
.yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory="200Mi"' customizations.yaml yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory="500Mi"' customizations.yaml
-
Check that the customization file has been updated.
-
Check the memory request value.
yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory' customizations.yaml
Expected output:
200Mi
-
Check the memory limit value.
yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory' customizations.yaml
Expected output:
500Mi
-
-
-
(
ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:Only follow these steps as part of the previously linked chart redeploy procedure.
-
Verify that the increased memory request and limit have been applied.
kubectl get deployment -n services cray-cfs-api -o json | jq .spec.template.spec.containers[0].resources
Example output:
{ "limits": { "cpu": "500m", "memory": "500Mi" }, "requests": { "cpu": "150m", "memory": "200Mi" } }
-
Run a CFS health check.
/usr/local/bin/cmsdev test -q cfs
For more details on this test, including known issues and other command line options, see Software Management Services health checks.
-
-
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
To make changes that will not persist across installs or upgrades, see the following references. These procedures will also help to verify and eliminate any issues in the short term. As other resource customizations are needed, contact support to request the feature.