Skip to content

Update the AWS node generate and setup scripts to support kubernetes 1.26 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dan-tw
Copy link

@dan-tw dan-tw commented May 16, 2023

Relevant components:

  • AWS Scripts

Problem statement:

AWS ImageBuilder changed the name of the action and relevant properties of the DownloadKubernetes step from ExecutePowerShell to WebDownload.

Kubernetes 1.24

- name: DownloadKubernetes
  action: ExecutePowerShell
  onFailure: Abort
  timeoutSeconds: 300
  inputs:
    commands:
      - $webClient = New-Object System.Net.WebClient
      - $webClient.DownloadFile('{{ KubernetesDownload }}/kubelet.exe', '{{ KubernetesPath }}\kubelet.exe')
      - $webClient.DownloadFile('{{ KubernetesDownload }}/kube-proxy.exe', '{{ KubernetesPath }}\kube-proxy.exe')
      - $webClient.DownloadFile('{{ KubernetesDownload }}/aws-iam-authenticator.exe', '{{ EKSPath }}\aws-iam-authenticator.exe')

Kubernetes 1.26

- name: DownloadKubernetes
  action: WebDownload
  maxAttempts: 3
  inputs:
    - source: '{{ KubernetesDownload }}/kubelet.exe'
      destination: '{{ KubernetesPath }}\kubelet.exe'
    - source: '{{ KubernetesDownload }}/kube-proxy.exe'
      destination: '{{ KubernetesPath }}\kube-proxy.exe'
    - source: '{{ KubernetesDownload }}/aws-iam-authenticator.exe'
      destination: '{{ EKSPath }}\aws-iam-authenticator.exe'

This was causing the cloud/aws/node/generate-setup-script.py to fail with an error: RuntimeError: Unknown build step action: WebDownload

Solution

Handles the new action WebDownload by manipulating the modified YAML into something that produces the same results as ExecutePowerShell did.

Documentation

N/A

Test Plan and Compatibility

Successfully build the AMI and ran a test using Scalable Pixel Streaming. The new Windows AMI using Kubernetes 1.26 successfully joined the EKS cluster and had access to GPU devices

Device List

2023-05-16T06:37:51.923Z	INFO	plugin/device_plugin.go:309	Received new device list	{"devices": [{"ID":"PCI\\VEN_10DE&DEV_1EB8&SUBSYS_12A210DE&REV_A1\\3&13C0B0C5&1&F0","Description":"NVIDIA Tesla T4","DriverRegistryKey":"HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Control\\Class\\{4d36e968-e325-11ce-bfc1-08002be10318}\\0003","DriverStorePath":"C:\\Windows\\System32\\DriverStore\\FileRepository\\nvgridsw_aws.inf_amd64_d77246250b1996fe","LocationPath":"PCIROOT(0)#PCI(1E00)","RuntimeFiles":[{"SourcePath":"nvcudadebugger.dll","DestinationFilename":"nvcudadebugger.dll"},{"SourcePath":"nvcuda_loader64.dll","DestinationFilename":"nvcuda.dll"},{"SourcePath":"nvcuvid64.dll","DestinationFilename":"nvcuvid.dll"},{"SourcePath":"nvEncodeAPI64.dll","DestinationFilename":"nvEncodeAPI64.dll"},{"SourcePath":"nvapi64.dll","DestinationFilename":"nvapi64.dll"},{"SourcePath":"nvml_loader.dll","DestinationFilename":"nvml.dll"},{"SourcePath":"OpenCL64.dll","DestinationFilename":"OpenCL.dll"},{"SourcePath":"vulkan-1-x64.dll","DestinationFilename":"vulkan-1.dll"},{"SourcePath":"nvidia-smi.exe","DestinationFilename":"nvidia-smi.exe"},{"SourcePath":"vulkaninfo-x64.exe","DestinationFilename":"vulkaninfo.exe"}],"RuntimeFilesWow64":[{"SourcePath":"nvcuda_loader32.dll","DestinationFilename":"nvcuda.dll"},{"SourcePath":"nvcuvid32.dll","DestinationFilename":"nvcuvid.dll"},{"SourcePath":"nvEncodeAPI.dll","DestinationFilename":"nvEncodeAPI.dll"},{"SourcePath":"nvapi.dll","DestinationFilename":"nvapi.dll"},{"SourcePath":"OpenCL32.dll","DestinationFilename":"OpenCL.dll"},{"SourcePath":"vulkan-1-x86.dll","DestinationFilename":"vulkan-1.dll"},{"SourcePath":"vulkaninfo-x86.exe","DestinationFilename":"vulkaninfo.exe"}],"Vendor":"NVIDIA","AdapterLUID":22515,"IsIntegrated":false,"IsDetachable":false,"SupportsDisplay":true,"SupportsCompute":true}]}

Windows Node

The Windows node is node ip-192-168-62-122.ap-southeast-2.compute.internal

$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-3-191.ap-southeast-2.compute.internal    Ready    <none>   52m   v1.26.2-eks-a59e1f0
ip-192-168-57-218.ap-southeast-2.compute.internal   Ready    <none>   52m   v1.26.2-eks-a59e1f0
ip-192-168-62-122.ap-southeast-2.compute.internal   Ready    <none>   36m   v1.26.2-eks-a59e1f0

Node Description

$ kubectl describe node ip-192-168-62-122.ap-southeast-2.compute.internal
Name:               ip-192-168-62-122.ap-southeast-2.compute.internal
Roles:              <none>
Labels:             alpha.service-controller.kubernetes.io/exclude-balancer=true
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g4dn.2xlarge
                    beta.kubernetes.io/os=windows
                    eks.amazonaws.com/capacityType=ON_DEMAND
                    eks.amazonaws.com/nodegroup=gpu-win
                    eks.amazonaws.com/nodegroup-image=ami-08588c41a0509f0b5
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-0be69fb76e0657f62
                    eks.amazonaws.com/sourceLaunchTemplateVersion=1
                    failure-domain.beta.kubernetes.io/region=ap-southeast-2
                    failure-domain.beta.kubernetes.io/zone=ap-southeast-2a
                    k8s.io/cloud-provider-aws=994e18435abc6e423a9f3b1c02b25d24
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-62-122.ap-southeast-2.compute.internal
                    kubernetes.io/os=windows
                    node.kubernetes.io/instance-type=g4dn.2xlarge
                    node.kubernetes.io/windows-build=10.0.20348
                    sps.tensorworks.com.au/gpu=true
                    topology.ebs.csi.aws.com/zone=ap-southeast-2a
                    topology.kubernetes.io/region=ap-southeast-2
                    topology.kubernetes.io/zone=ap-southeast-2a
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.62.122
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-037c67e7a38bee2fb"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 16 May 2023 16:37:09 +1000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-62-122.ap-southeast-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Tue, 16 May 2023 17:14:23 +1000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 16 May 2023 17:12:01 +1000   Tue, 16 May 2023 16:37:04 +1000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 16 May 2023 17:12:01 +1000   Tue, 16 May 2023 16:37:04 +1000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 16 May 2023 17:12:01 +1000   Tue, 16 May 2023 16:37:04 +1000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 16 May 2023 17:12:01 +1000   Tue, 16 May 2023 16:37:09 +1000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.62.122
  ExternalIP:   54.252.175.199
  InternalDNS:  ip-192-168-62-122.ap-southeast-2.compute.internal
  Hostname:     ip-192-168-62-122.ap-southeast-2.compute.internal
  ExternalDNS:  ec2-54-252-175-199.ap-southeast-2.compute.amazonaws.com
Capacity:
  cpu:                                   8
  directx.microsoft.com/compute:         0
  directx.microsoft.com/display:         1
  ephemeral-storage:                     209713148Ki
  memory:                                33072664Ki
  pods:                                  110
  vpc.amazonaws.com/PrivateIPv4Address:  9
Allocatable:
  cpu:                                   8
  directx.microsoft.com/compute:         0
  directx.microsoft.com/display:         1
  ephemeral-storage:                     193271636877
  memory:                                32970264Ki
  pods:                                  110
  vpc.amazonaws.com/PrivateIPv4Address:  9
System Info:
  Machine ID:                 EC2AMAZ-MQ0OFI3
  System UUID:                EC28BAFE-B5AD-0D55-4EE1-A904A846A233
  Boot ID:                    296
  Kernel Version:             10.0.20348.1726
  OS Image:                   Windows Server 2022 Datacenter
  Operating System:           windows
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.0
  Kubelet Version:            v1.26.2-eks-a59e1f0
  Kube-Proxy Version:         v1.26.2-eks-a59e1f0
ProviderID:                   aws:///ap-southeast-2a/i-037c67e7a38bee2fb
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  default                     overprovisioning-windows-8d7b87867-46zfw    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18m
  kube-system                 device-plugin-mcdm-blfsn                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
  kube-system                 device-plugin-wddm-pbjnz                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
  kube-system                 ebs-csi-node-windows-bcdtj                  30m (0%)      300m (3%)   120Mi (0%)       768Mi (2%)     37m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                              Requests    Limits
  --------                              --------    ------
  cpu                                   30m (0%)    300m (3%)
  memory                                120Mi (0%)  768Mi (2%)
  ephemeral-storage                     0 (0%)      0 (0%)
  directx.microsoft.com/compute         0           0
  directx.microsoft.com/display         1           1
  vpc.amazonaws.com/PrivateIPv4Address  2           2
Events:
  Type    Reason                   Age                From        Message
  ----    ------                   ----               ----        -------
  Normal  Starting                 37m                kube-proxy  
  Normal  Starting                 37m                kubelet     Starting kubelet.
  Normal  NodeHasSufficientMemory  37m (x2 over 37m)  kubelet     Node ip-192-168-62-122.ap-southeast-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeReady                37m                kubelet     Node ip-192-168-62-122.ap-southeast-2.compute.internal status is now: NodeReady

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant