Skip to content

Commit 760ffbb

Browse files
committed
Merge branch 'feat/tf-multiple-networks' into feat/k3s-node-ip
2 parents e687740 + cd423b5 commit 760ffbb

File tree

18 files changed

+152
-38
lines changed

18 files changed

+152
-38
lines changed

Diff for: README.md

+7-5
Original file line numberDiff line numberDiff line change
@@ -82,33 +82,35 @@ And generate secrets for it:
8282

8383
Create an OpenTofu variables file to define the required infrastructure, e.g.:
8484

85-
# environments/$ENV/terraform/terraform.tfvars:
85+
# environments/$ENV/tofu/tofu.tfvars:
8686

8787
cluster_name = "mycluster"
8888
cluster_net = "some_network" # *
8989
cluster_subnet = "some_subnet" # *
9090
key_pair = "my_key" # *
9191
control_node_flavor = "some_flavor_name"
9292
login = {
93+
# Arbitrary group name for these login nodes
9394
interactive = {
9495
nodes: ["login-0"]
95-
flavor: "login_flavor_name"
96+
flavor: "login_flavor_name" # *
9697
}
9798
}
9899
cluster_image_id = "rocky_linux_9_image_uuid"
99100
compute = {
101+
# Group name used for compute node partition definition
100102
general = {
101103
nodes: ["compute-0", "compute-1"]
102-
flavor: "compute_flavor_name"
104+
flavor: "compute_flavor_name" # *
103105
}
104106
}
105107

106-
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/terraform/terraform.tfvars`.
108+
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/tofu/tofu.tfvars`.
107109

108110
To deploy this infrastructure, ensure the venv and the environment are [activated](#create-a-new-environment) and run:
109111

110112
export OS_CLOUD=openstack
111-
cd environments/$ENV/terraform/
113+
cd environments/$ENV/tofu/
112114
tofu init
113115
tofu apply
114116

Diff for: ansible/ci/retrieve_inventory.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
gather_facts: no
88
vars:
99
cluster_prefix: "{{ undef(hint='cluster_prefix must be defined') }}" # e.g. ci4005969475
10-
ci_vars_file: "{{ appliances_environment_root + '/terraform/' + lookup('env', 'CI_CLOUD') }}.tfvars"
10+
ci_vars_file: "{{ appliances_environment_root + '/tofu/' + lookup('env', 'CI_CLOUD') }}.tfvars"
1111
cluster_network: "{{ lookup('ansible.builtin.ini', 'cluster_net', file=ci_vars_file, type='properties') | trim('\"') }}"
1212
tasks:
1313
- name: Get control host IP

Diff for: ansible/roles/block_devices/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This is a convenience wrapper around the ansible modules:
1111

1212
To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.
1313

14-
**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/terraform/control.userdata.tpl`.
14+
**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/tofu/control.userdata.tpl`.
1515

1616
[^1]: See `environments/common/inventory/group_vars/builder/defaults.yml`
1717

Diff for: ansible/roles/compute_init/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ The following roles/groups are currently fully functional:
4343
- `openhpc`: all functionality
4444

4545
The above may be enabled by setting the compute_init_enable property on the
46-
terraform compute variable.
46+
tofu compute variable.
4747

4848
# Development/debugging
4949

Diff for: ansible/roles/freeipa/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s
77

88
## Usage
99
- Add hosts to the `freeipa_client` group and run (at a minimum) the `ansible/iam.yml` playbook.
10-
- Host names must match the domain name. By default (using the skeleton Terraform) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are Terraform variables.
10+
- Host names must match the domain name. By default (using the skeleton OpenTofu) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are OpenTofu variables.
1111
- Hosts discover the FreeIPA server FQDN (and their own domain) from DNS records. If DNS servers are not set this is not set from DHCP, then use the `resolv_conf` role to configure this. For example when using the in-appliance FreeIPA development server:
1212

1313
```ini
@@ -28,7 +28,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s
2828
- For production use with an external FreeIPA server, a random one-time password (OTP) must be generated when adding hosts to FreeIPA (e.g. using `ipa host-add --random ...`). This password should be set as a hostvar `freeipa_host_password`. Initial host enrolment will use this OTP to enrol the host. After this it becomes irrelevant so it does not need to be committed to git. This approach means the appliance does not require the FreeIPA administrator password.
2929
- For development use with the in-appliance FreeIPA server, `freeipa_host_password` will be automatically generated in memory.
3030
- The `control` host must define `appliances_state_dir` (on persistent storage). This is used to back-up keytabs to allow FreeIPA clients to automatically re-enrol after e.g. reimaging. Note that:
31-
- This is implemented when using the skeleton Terraform; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
31+
- This is implemented when using the skeleton OpenTofu; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
3232
- Nodes are not re-enroled by a [Slurm-driven reimage](../../collections/ansible_collections/stackhpc/slurm_openstack_tools/roles/rebuild/README.md) (as that does not run this role).
3333
- If both a backed-up keytab and `freeipa_host_password` exist, the former is used.
3434

Diff for: docs/networks.md

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Networking
2+
3+
The default OpenTofu configurations in the appliance do not provision networks,
4+
subnets or associated infrastructure such as routers. The requirements are that:
5+
1. At least one network exists.
6+
2. The first network defined spans all nodes, referred to as the "access network".
7+
3. Only one subnet per network is attached to nodes.
8+
4. At least one network on each node provides outbound internet access (either
9+
directly, or via a proxy).
10+
11+
Futhermore, it is recommended that the deploy host has an interface on the
12+
access network. While it is possible to e.g. use a floating IP on a login node
13+
as an SSH proxy to access the other nodes, this can create problems in recovering
14+
the cluster if the login node is unavailable and can make Ansible problems harder
15+
to debug.
16+
17+
This page describes supported configurations and how to implement them using
18+
the OpenTofu variables. These will normally be set in
19+
`environments/site/tofu/terraform.tfvars` for the site base environment. If they
20+
need to be overriden for specific environments, this can be done via an OpenTofu
21+
module as discussed [here](./production.md).
22+
23+
Note that if an OpenStack subnet has a gateway IP defined then nodes with ports
24+
attached to that subnet will get a default route set via that gateway.
25+
26+
## Single network
27+
This is the simplest possible configuration. A single network and subnet is
28+
used for all nodes. The subnet provides outbound internet access via the default
29+
route defined by the subnet gateway (often an OpenStack router to an external
30+
network).
31+
32+
```terraform
33+
cluster_networks = [
34+
{
35+
network = "netA"
36+
subnet = "subnetA"
37+
}
38+
]
39+
...
40+
```
41+
42+
## Multiple homogenous networks
43+
This is similar to the above, except each node has multiple networks. The first
44+
network, "netA" is the access network. Note that only one subnet must have a
45+
gateway defined, else default routes via both subnets will be present causing
46+
routing problems. It also shows the second network (netB) using direct-type
47+
vNICs for RDMA.
48+
49+
```terraform
50+
cluster_networks = [
51+
{
52+
network = "netA"
53+
subnet = "subnetA"
54+
},
55+
{
56+
network = "netB"
57+
subnet = "subnetB"
58+
},
59+
]
60+
61+
vnic_types = {
62+
netB = "direct"
63+
}
64+
...
65+
```
66+
67+
68+
## Additional networks on some nodes
69+
70+
This example shows how to modify variables for specific node groups. In this
71+
case a baremetal node group has a second network attached. As above, only a
72+
single subnet can have a gateway IP.
73+
74+
```terraform
75+
cluster_networks = [
76+
{
77+
network = "netA"
78+
subnet = "subnetA"
79+
}
80+
]
81+
82+
compute = {
83+
general = {
84+
nodes = ["general-0", "general-1"]
85+
}
86+
baremetal = {
87+
nodes = ["baremetal-0", "baremetal-1"]
88+
extra_networks = [
89+
{
90+
network = "netB"
91+
subnet = "subnetB"
92+
}
93+
]
94+
vnic_types = {
95+
netA = "baremetal"
96+
netB = "baremetal"
97+
...
98+
}
99+
}
100+
}
101+
...
102+
```

Diff for: docs/openondemand.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ See the [ansible/roles/openondemand/README.md](../ansible/roles/openondemand/REA
3333
The following variables have been given default values to allow Open OnDemand to work in a newly created environment without additional configuration, but generally should be overridden in `environment/site/inventory/group_vars/all/` with site-specific values:
3434
- `openondemand_servername` - this must be defined for both `openondemand` and `grafana` hosts (when Grafana is enabled). Default is `ansible_host` (i.e. the IP address) of the first host in the `openondemand` group.
3535
- `openondemand_auth` and any corresponding options. Defaults to `basic_pam`.
36-
- `openondemand_desktop_partition` and `openondemand_jupyter_partition` if the corresponding inventory groups are defined. Defaults to the first compute group defined in the `compute` Terraform variable in `environments/$ENV/terraform`.
36+
- `openondemand_desktop_partition` and `openondemand_jupyter_partition` if the corresponding inventory groups are defined. Defaults to the first compute group defined in the `compute` OpenTofu variable in `environments/$ENV/tofu`.
3737

3838
It is also recommended to set:
3939
- `openondemand_dashboard_support_url`

Diff for: docs/operations.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,10 @@ This is a usually a two-step process:
5757

5858
- If new nodes are required, define a new node group by adding an entry to the `compute` mapping in `environments/$ENV/tofu/main.tf` assuming the default OpenTofu configuration:
5959
- The key is the partition name.
60-
- The value should be a mapping, with the parameters defined in `environments/$SITE_ENV/terraform/compute/variables.tf`, but in brief will need at least `flavor` (name) and `nodes` (a list of node name suffixes).
60+
- The value should be a mapping, with the parameters defined in `environments/$SITE_ENV/tofu/compute/variables.tf`, but in brief will need at least `flavor` (name) and `nodes` (a list of node name suffixes).
6161
- Add a new partition to the partition configuration as described under [Modifying Slurm Partition-specific Configuration](#Modifying-Slurm-Partition-specific-Configuration).
6262

63-
Deploying the additional nodes and applying these changes requires rerunning both Terraform and the Ansible site.yml playbook - follow [Deploying a Cluster](#Deploying-a-Cluster).
63+
Deploying the additional nodes and applying these changes requires rerunning both OpenTofu and the Ansible site.yml playbook - follow [Deploying a Cluster](#Deploying-a-Cluster).
6464

6565
# Adding Additional Packages
6666
By default, the following utility packages are installed during the StackHPC image build:

Diff for: docs/persistent-state.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ If using the `environments/common/layout/everything` Ansible groups template (wh
1313

1414
Note that if `appliances_state_dir` is defined, the path it gives must exist and should be owned by root. Directories will be created within this with appropriate permissions for each item of state defined above. Additionally, the systemd units for the services listed above will be modified to require `appliances_state_dir` to be mounted before service start (via the `systemd` role).
1515

16-
A new cookiecutter-produced environment supports persistent state in the default Terraform (see `environments/skeleton/{{cookiecutter.environment}}/terraform/`) by:
16+
A new cookiecutter-produced environment supports persistent state in the default OpenTofu (see `environments/skeleton/{{cookiecutter.environment}}/tofu/`) by:
1717

18-
- Defining a volume with a default size of 150GB - this can be controlled by the Terraform variable `state_volume_size`.
18+
- Defining a volume with a default size of 150GB - this can be controlled by the OpenTofu variable `state_volume_size`.
1919
- Attaching it to the control node.
2020
- Defining cloud-init userdata for the control node which formats and mounts this volume at `/var/lib/state`.
21-
- Defining `appliances_state_dir: /var/lib/state` for the control node in the (Terraform-templated) `inventory/hosts` file.
21+
- Defining `appliances_state_dir: /var/lib/state` for the control node in the (OpenTofu-templated) `inventory/hosts` file.
2222

23-
**NB: The default Terraform is provided as a working example and for internal CI use - therefore this volume is deleted when running `terraform destroy` - this may not be appropriate for a production environment.**
23+
**NB: The default OpenTofu is provided as a working example and for internal CI use - therefore this volume is deleted when running `tofu destroy` - this may not be appropriate for a production environment.**
2424

2525
In general, the Prometheus data is likely to be the only sizeable state stored. The size of this can be influenced through [Prometheus role variables](https://github.com/cloudalchemy/ansible-prometheus#role-variables), e.g.:
2626
- `prometheus_storage_retention` - [default](../environments/common/inventory/group_vars/all/prometheus.yml) 31d

Diff for: docs/production.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -41,15 +41,15 @@ and referenced from the `site` and `production` environments, e.g.:
4141
- OpenTofu configurations should be defined in the `site` environment and used
4242
as a module from the other environments. This can be done with the
4343
cookie-cutter generated configurations:
44-
- Delete the *contents* of the cookie-cutter generated `terraform/` directories
44+
- Delete the *contents* of the cookie-cutter generated `tofu/` directories
4545
from the `production` and `staging` environments.
46-
- Create a `main.tf` in those directories which uses `site/terraform/` as a
46+
- Create a `main.tf` in those directories which uses `site/tofu/` as a
4747
[module](https://opentofu.org/docs/language/modules/), e.g. :
4848

4949
```
5050
...
5151
module "cluster" {
52-
source = "../../site/terraform/"
52+
source = "../../site/tofu/"
5353

5454
cluster_name = "foo"
5555
...
@@ -61,7 +61,7 @@ and referenced from the `site` and `production` environments, e.g.:
6161
into the module block.
6262
- Environment-independent variables (e.g. maybe `cluster_net` if the
6363
same is used for staging and production) should be set as *defaults*
64-
in `environments/site/terraform/variables.tf`, and then don't need to
64+
in `environments/site/tofu/variables.tf`, and then don't need to
6565
be passed in to the module.
6666
6767
- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates
@@ -102,7 +102,7 @@ and referenced from the `site` and `production` environments, e.g.:
102102

103103
- Consider whether having (read-only) access to Grafana without login is OK. If not, remove `grafana_auth_anonymous` in `environments/$ENV/inventory/group_vars/all/grafana.yml`
104104

105-
- Modify `environments/site/terraform/nodes.tf` to provide fixed IPs for at least
105+
- Modify `environments/site/tofu/nodes.tf` to provide fixed IPs for at least
106106
the control node, and (if not using FIPs) the login node(s):
107107

108108
```

Diff for: docs/upgrades.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ All other commands should be run on the Ansible deploy host.
6262

6363
1. If required, build an "extra" image with local modifications, see [docs/image-build.md](./image-build.md).
6464

65-
1. Modify your site-specific environment to use this image, e.g. via `cluster_image_id` in `environments/$SITE_ENV/terraform/variables.tf`.
65+
1. Modify your site-specific environment to use this image, e.g. via `cluster_image_id` in `environments/$SITE_ENV/tofu/variables.tf`.
6666

6767
1. Test this in your staging cluster.
6868

Diff for: environments/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ typically contains all the environment specific config. It must output an ansibl
77
that conforms to the structure we expect. Providing that the inventory conforms to this
88
structure, the ansible code will still be able to interface with that inventory.
99
This allows the ansible code to be decoupled from the code that deployed the infrastructure
10-
and can therefore be tool and cloud agnostic i.e we don't care if you use terraform or ansible.
10+
and can therefore be tool and cloud agnostic.
1111

1212
A pattern we use is to chain multiple ansible inventories to provide a crude form of inheritance. e.g
1313

Diff for: environments/common/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This contains an inventory that defines variables which are common between the
44
`production` and `development` environments. It is not intended to be used in
5-
a standalone fashion to deploy infrastructure (i.e no terraform), but is instead
5+
a standalone fashion to deploy infrastructure, but is instead
66
referenced in `ansible.cfg` from the `production` and `development` configurations.
77

88
The pattern we use is that all resources referenced in the inventory

Diff for: environments/skeleton/{{cookiecutter.environment}}/tofu/compute.tf

+3
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ module "compute" {
2020
vnic_profiles = lookup(each.value, "vnic_profiles", var.vnic_profiles)
2121
volume_backed_instances = lookup(each.value, "volume_backed_instances", var.volume_backed_instances)
2222
root_volume_size = lookup(each.value, "root_volume_size", var.root_volume_size)
23+
24+
# optionally set for group
25+
networks = concat(var.cluster_networks, lookup(each.value, "extra_networks", []))
2326
extra_volumes = lookup(each.value, "extra_volumes", {})
2427
compute_init_enable = lookup(each.value, "compute_init_enable", [])
2528
ignore_image_changes = lookup(each.value, "ignore_image_changes", false)

Diff for: environments/skeleton/{{cookiecutter.environment}}/tofu/control.tf

+2-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
locals {
22
control_volumes = concat([openstack_blockstorage_volume_v3.state], var.home_volume_size > 0 ? [openstack_blockstorage_volume_v3.home][0] : [])
3-
4-
access_network_name = length(var.cluster_networks) == 1 ? var.cluster_networks[0].network : [for n in var.cluster_networks: n if lookup(n, "access_network", false)][0].network
53
}
64

75
resource "openstack_networking_port_v2" "control" {
@@ -55,14 +53,14 @@ resource "openstack_compute_instance_v2" "control" {
5553
for_each = {for net in var.cluster_networks: net.network => net}
5654
content {
5755
port = openstack_networking_port_v2.control[network.key].id
58-
access_network = network.key == local.access_network_name
56+
access_network = network.key == var.cluster_networks[0].network
5957
}
6058
}
6159

6260
metadata = {
6361
environment_root = var.environment_root
6462
k3s_token = local.k3s_token
65-
access_ip = openstack_networking_port_v2.control[local.access_network_name].all_fixed_ips[0]
63+
access_ip = openstack_networking_port_v2.control[var.cluster_networks[0].network].all_fixed_ips[0]
6664
}
6765

6866
user_data = <<-EOF

Diff for: environments/skeleton/{{cookiecutter.environment}}/tofu/login.tf

+4-1
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,17 @@ module "login" {
1111
cluster_domain_suffix = var.cluster_domain_suffix
1212

1313
# can be set for group, defaults to top-level value:
14-
networks = lookup(each.value, "networks", var.cluster_networks)
1514
image_id = lookup(each.value, "image_id", var.cluster_image_id)
1615
vnic_types = lookup(each.value, "vnic_types", var.vnic_types)
1716
vnic_profiles = lookup(each.value, "vnic_profiles", var.vnic_profiles)
1817
volume_backed_instances = lookup(each.value, "volume_backed_instances", var.volume_backed_instances)
1918
root_volume_size = lookup(each.value, "root_volume_size", var.root_volume_size)
19+
20+
# optionally set for group
21+
networks = concat(var.cluster_networks, lookup(each.value, "extra_networks", []))
2022
extra_volumes = lookup(each.value, "extra_volumes", {})
2123

24+
# can't be set for login
2225
compute_init_enable = []
2326
ignore_image_changes = false
2427

0 commit comments

Comments
 (0)