Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal3 pod crash on baremetal 4.16.0-0.okd-scos-2024-08-21-155613 #2030

Open
snehring opened this issue Sep 17, 2024 · 4 comments
Open

metal3 pod crash on baremetal 4.16.0-0.okd-scos-2024-08-21-155613 #2030

snehring opened this issue Sep 17, 2024 · 4 comments

Comments

@snehring
Copy link

snehring commented Sep 17, 2024

Describe the bug
metal3 pod in CrashLoopBackoff due to failure in metal3-ironic-inspector. Seems very similar to bug described in OCPBUGS-32304

Version
4.16.0-0.okd-scos-2024-08-21-155613 baremetal ipi

How reproducible
It's happened on two clusters I've set up since 4.16.0-0.okd-scos-2024-08-21-155613 became available.

Log bundle

+ CONFIG=/etc/ironic-inspector/ironic-inspector.conf
+ export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ export INSPECTOR_REVERSE_PROXY_SETUP=true
+ INSPECTOR_REVERSE_PROXY_SETUP=true
+ . /bin/tls-common.sh
++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ export IRONIC_KEY_FILE=/certs/ironic/tls.key
++ IRONIC_KEY_FILE=/certs/ironic/tls.key
++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ export IRONIC_INSECURE=true
++ IRONIC_INSECURE=true
++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_INSECURE=true
++ IRONIC_INSPECTOR_INSECURE=true
++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ export IPXE_KEY_FILE=/certs/ipxe/tls.key
++ IPXE_KEY_FILE=/certs/ipxe/tls.key
++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ export IPXE_TLS_PORT=8084
++ IPXE_TLS_PORT=8084
++ mkdir -p /certs/ironic
++ mkdir -p /certs/ironic-inspector
++ mkdir -p /certs/ca/ironic
mkdir: cannot create directory '/certs/ca/ironic': Permission denied
@snehring snehring changed the title metal3 pod crash in on baremetal 4.16.0-0.okd-scos-2024-08-21-155613 metal3 pod crash on baremetal 4.16.0-0.okd-scos-2024-08-21-155613 Sep 17, 2024
@snehring
Copy link
Author

Actually the issue seems to be a little different. 1002 and 1004 aren't the uid and gid of ironic and ironic-inspector in the container image

sh-5.1$ id ironic
uid=997(ironic) gid=995(ironic) groups=995(ironic)
sh-5.1$ id ironic-inspector
uid=996(ironic-inspector) gid=994(ironic-inspector) groups=994(ironic-inspector)
sh-5.1$ ls -lan /certs/ca
total 0
drwxrwsr-x. 2 997 994  6 Jun 11 10:39 .
drwxrwsr-x. 1 997 994 44 Sep 17 19:50 ..
sh-5.1$ id
uid=1002(1002) gid=1004 groups=1004

so the permission errors make sense

@snehring
Copy link
Author

I think the issue lies with prepare-image.sh per the image manifest the package list file created for okd is called main-packages-list.okd instead of main-packages-list.ocp

@snehring
Copy link
Author

I think I've got the root of the problem figured out and put in openshift/ironic-image#581

@snehring
Copy link
Author

snehring commented Oct 1, 2024

A more correct fix would be to backport openshift/cluster-baremetal-operator#430 to 4.16 or at least the part that removes the uid/gid customization. There's no rationale in that PR for the change and I have no access to the red hat jira, so that strongly limits my ability to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant