Skip to content

Commit cad1b61

Browse files
committed
fiber -> fabric
Signed-off-by: vsoch <[email protected]>
1 parent a122381 commit cad1b61

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

_posts/2024/2024-06-11-usernetes.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,15 @@ But there were some weakness in the setup. Specifically, <a href="https://arxiv.
1111

1212
### Moving Earth with Terraform
1313

14-
The setup (using Terraform) had many gotches, specifically related to ensuring that we deployed with 2 or more availability zones, but only gave one to the managed node group to use, since the usernetes nodes needed the internal IP addresses to connect. They could not see one another between different availability zones, but the deployment wouldn't work if you only asked for one. The next issue was the Flux broker needing preditable hostnames, and AWS not having any reliable way (aside from a robust setup with Route 53 which was too much for a "bring up and throw away" cluster) to create them. The headless Service in Kubernetes solves this for us (we can predict them) but this setup is running Flux directly on the VM, and the instance hostnames are garbled strings and numbers. So instead, we have to query the API to derive them on startup using a selector with name and instance type, wait for the expected number of instances (from the Terraform config), and write the Flux broker file dynamically. That also handles variables for the cluster from the <a href="https://github.com/converged-computing/flux-usernetes/blob/4790b59b81e7350094f58a1e14243eb3a904b015/aws/tf/main.tf" target="_blank">terraform main file</a> such as the network adapter, number of instances, and selectors. Next, the security group needs ingress and egress to itself, and the health check needs to be set to a huge value so your entire cluster isn't brought down by one node deemed unhealthy right in the middle (this happened to me in prototyping, and since I hadn't automated everything yet I'd fall to my hands and knees in anguish, just kidding). The elastic fiber adapter (EFA) and placement group are fairly important for good networking, but in practice I've seen quite variable performance regardless. In other words, you can easily get a bad node, and more work is needed to understand this.
14+
The setup (using Terraform) had many gotches, specifically related to ensuring that we deployed with 2 or more availability zones, but only gave one to the managed node group to use, since the usernetes nodes needed the internal IP addresses to connect. They could not see one another between different availability zones, but the deployment wouldn't work if you only asked for one. The next issue was the Flux broker needing preditable hostnames, and AWS not having any reliable way (aside from a robust setup with Route 53 which was too much for a "bring up and throw away" cluster) to create them. The headless Service in Kubernetes solves this for us (we can predict them) but this setup is running Flux directly on the VM, and the instance hostnames are garbled strings and numbers. So instead, we have to query the API to derive them on startup using a selector with name and instance type, wait for the expected number of instances (from the Terraform config), and write the Flux broker file dynamically. That also handles variables for the cluster from the <a href="https://github.com/converged-computing/flux-usernetes/blob/4790b59b81e7350094f58a1e14243eb3a904b015/aws/tf/main.tf" target="_blank">terraform main file</a> such as the network adapter, number of instances, and selectors. Next, the security group needs ingress and egress to itself, and the health check needs to be set to a huge value so your entire cluster isn't brought down by one node deemed unhealthy right in the middle (this happened to me in prototyping, and since I hadn't automated everything yet I'd fall to my hands and knees in anguish, just kidding). The elastic fabric adapter (EFA) and placement group are fairly important for good networking, but in practice I've seen quite variable performance regardless. In other words, you can easily get a bad node, and more work is needed to understand this.
1515

1616
### Memory Limits
1717

1818
The next problem is about memory, and perhaps this was the most challenging thing to figure out. it took me upwards of a month to get our applications running, because MPI continually threw up on me with an error about memory. It didn't make any sense - the hpc family of instances on AWS were configured not to set any limits on memory, and we had lots of it. I had tried every MPI variant and install under the sun. It wasn't until I talked with one of our brilliant Flux developers he mentioned that I might test what Flux sees - running a command to see the memory limit as a flux job. And what we saw? It was limited! It turns out that starting Flux as a service (as this systemd setup did) was setting a limit. Why? Because <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Process%20Properties" target="_blank">systemd is a jerk</a>. We never saw this in the Flux Operator because we don't use systemd - we just start the brokers directly. The fix was to add "LimitMEMLOCK=infinity" to the flux service file, and then our HPC application ran! It was an amazing moment, because it meant we were a tiny bit closer to using this setup. I pinged a lot of AWS folks for help, and although we figured it out separately, I want to say a huge thank you for looking into possible issues for this bug! I hope we helped uncover this insight for future folks.
1919

20-
### Elastic Fiber Adapter
20+
### Elastic Fabric Adapter
2121

22-
With the above, we had Flux running on bare metal, and performantly with the elastic fiber adapter. But now - how to spin up a user-space Kubernetes cluster that is running inside of Flux, and in that cluster, bring up the Flux Operator (another Flux cluster) that runs the same application? I first needed an automated way to set the entire thing up - and ssh'ing into many nodes wasn't something I wanted to do for more than 2-3 nodes. I was able to bring up the control plane on the lead broker node, then use flux exec and flux archive to send over the join command to the workers and issue a command to run a script to start and connect. After installing the Flux Operator we then needed to expose the EFA driver to the container. I needed to customize the daemonset installer to work in my Usernetes cluster, and I was able to do that by removing a lot of the selectors that expected AWS instances for nodes. And then - everything worked! Almost...
22+
With the above, we had Flux running on bare metal, and performantly with the elastic fabric adapter. But now - how to spin up a user-space Kubernetes cluster that is running inside of Flux, and in that cluster, bring up the Flux Operator (another Flux cluster) that runs the same application? I first needed an automated way to set the entire thing up - and ssh'ing into many nodes wasn't something I wanted to do for more than 2-3 nodes. I was able to bring up the control plane on the lead broker node, then use flux exec and flux archive to send over the join command to the workers and issue a command to run a script to start and connect. After installing the Flux Operator we then needed to expose the EFA driver to the container. I needed to customize the daemonset installer to work in my Usernetes cluster, and I was able to do that by removing a lot of the selectors that expected AWS instances for nodes. And then - everything worked! Almost...
2323

2424
### It's that slirpy guy, or is it?
2525

0 commit comments

Comments
 (0)