Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 37 additions & 10 deletions scripts/rhel-ha-aws-check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,23 @@ for HAPKG in ${HAPKGS}; do
fi
fi

yum -y install ${HAPKG}
if [ $? -ne 0 ]; then
echo "Install of ${HAPKG} failed."
# Retry logic for yum lock errors
MAX_RETRIES=5
RETRY_COUNT=0
SUCCESS=0
while [ ${RETRY_COUNT} -lt ${MAX_RETRIES} ] && [ ${SUCCESS} -eq 0 ]; do
yum -y install ${HAPKG}
if [ $? -eq 0 ]; then
SUCCESS=1
else
echo "Attempt $((RETRY_COUNT+1)) to install ${HAPKG} failed. Retrying in 5 seconds..."
sleep 5
RETRY_COUNT=$((RETRY_COUNT+1))
fi
done

if [ ${SUCCESS} -ne 1 ]; then
echo "Install of ${HAPKG} failed after multiple retries."
exit 1
else
rpm -q ${HAPKG}
Expand Down Expand Up @@ -252,6 +266,9 @@ for R in ${RAS}; do
exit 1
fi
Comment on lines 266 to 267
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider logging the specific error encountered after retries fail.

Including the last error message will make it easier to diagnose why the installation failed.

Suggested change
exit 1
fi
INSTALL_OUTPUT=$(yum -y install ${HAPKG} 2>&1)
if [ $? -eq 0 ]; then
SUCCESS=1
else
LAST_ERROR="${INSTALL_OUTPUT}"
echo "Attempt $((RETRY_COUNT+1)) to install ${HAPKG} failed. Retrying in 5 seconds..."
sleep 5
RETRY_COUNT=$((RETRY_COUNT+1))
fi
done
if [ ${SUCCESS} -ne 1 ]; then
echo "Install of ${HAPKG} failed after multiple retries."
echo "Last error encountered:"
echo "${LAST_ERROR}"
exit 1
else
rpm -q ${HAPKG}
exit 1
fi


# Here the script can exit with a "Warning: required resource options..."
# The --force flag is intended to bypass this, but it still prints the warning
# and the resource is created successfully, so the script continues.
pcs resource create --force ${RA} ocf:heartbeat:${RA}
if [ $? -ne 0 ]; then
echo "Cannot create resource agent ${RA} resource."
Expand Down Expand Up @@ -308,35 +325,45 @@ fi
for R in ${RAS}; do
RA=$(basename ${R})

# Disable the resource if it's currently running or failed
pcs resource disable --wait=5 ${RA}
# This command can return a non-zero exit code if the resource is already stopped.
# The script handles this by checking for a return code of 0 or 1.
if [ $? -ne 0 -a $? -ne 1 ]; then
echo "Cannot disable resource agent ${RA}."
exit 1
fi

# Clean up any failed operations for the resource
pcs resource cleanup ${RA}
if [ $? -ne 0 -a $? -ne 1 ]; then
echo "Cannot cleanup resource agent ${RA} resource."
exit 1
fi

pcs resource delete ${RA}
if [ $? -ne 0 ]; then
echo "Cannot delete resource agent ${RA} resource."
exit 1
# Check if the resource exists before attempting to delete it
# This prevents the script from failing if the resource is already gone.
pcs resource config ${RA} &> /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we moved forward reusable functions and more readable error logging.
A simple encapsulation into a delete resource function would increase readability and maintainability.
It would also avoid having to add explaining comments to our code which in itself is a bad code smell to me.
As an example:

Suggested change
pcs resource config ${RA} &> /dev/null
delete_resource() {
local RA="$1"
if pcs resource show "$RA" &> /dev/null; then
if pcs resource delete "$RA"; then
echo "Resource '$RA' deleted successfully."
else
echo "failed to delete resource '$RA'."
exit 1
fi
else
echo "resource '$RA' not found. Skipping deletion."
fi
}
delete_resource "${RA}"

if [ $? -eq 0 ]; then
pcs resource delete ${RA}
if [ $? -ne 0 ]; then
echo "Cannot delete resource agent ${RA} resource."
exit 1
fi
fi

if [ ${RHELMAJOR} -lt 8 ]; then
pcs resource show ${RA}
pcs resource show ${RA}
else
pcs resource config ${RA}
pcs resource config ${RA}
fi
if [ $? -eq 0 ]; then
echo "Got resource configuration for ${RA}."
echo "Removal of ${RA} failed."
exit 1
fi
done


# Make the cluster forget failed operations from history of the resource
# and re-detect its current state
pcs resource cleanup
Expand Down