-
Notifications
You must be signed in to change notification settings - Fork 22
Ha script #464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Ha script #464
Conversation
checking for the existence of the lock file
/var/lib/rpm/.rpm.lock and waiting for it to be released
before proceeding
Reviewer's GuideThis PR bolsters the RHEL HA AWS check script by introducing retry logic for package installations to work around transient yum lock errors and hardens the resource cleanup workflow with improved exit-code handling, existence checks, and explanatory comments. Flow diagram for improved yum install retry logicflowchart TD
A["Start package install loop"] --> B["Attempt yum install ${HAPKG}"]
B --> C{Install succeeded?}
C -- Yes --> D["Set SUCCESS=1"]
C -- No --> E["Print attempt failed message"]
E --> F["Sleep 5 seconds"]
F --> G["Increment RETRY_COUNT"]
G --> B
D --> H{SUCCESS == 1?}
H -- Yes --> I["rpm -q ${HAPKG}"]
H -- No --> J["Print install failed after retries"]
J --> K["Exit 1"]
I --> L["Continue script"]
Flow diagram for improved resource deletion logicflowchart TD
A["For each resource agent"] --> B["Disable resource"]
B --> C{Exit code 0 or 1?}
C -- No --> D["Print error and exit"]
C -- Yes --> E["Cleanup resource"]
E --> F{Exit code 0 or 1?}
F -- No --> D
F -- Yes --> G["Check if resource exists (pcs resource config)"]
G --> H{Resource exists?}
H -- Yes --> I["Delete resource"]
I --> J{Delete succeeded?}
J -- No --> D
J -- Yes --> K["Show/config resource"]
H -- No --> K
K --> L{Resource still present?}
L -- Yes --> D["Print removal failed and exit"]
L -- No --> M["Continue"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Avoid calling
$?multiple times in the same conditional—capture the exit code into a variable before testing to prevent mismatched results. - The final check after deletion prints “Removal of ${RA} failed.” on a successful
pcs resource config(i.e. the resource still exists), which is confusing; revise the conditional and message to clearly reflect a deletion failure. - Consider refactoring the yum retry loop into a reusable function to reduce duplication and improve readability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Avoid calling `$?` multiple times in the same conditional—capture the exit code into a variable before testing to prevent mismatched results.
- The final check after deletion prints “Removal of ${RA} failed.” on a successful `pcs resource config` (i.e. the resource still exists), which is confusing; revise the conditional and message to clearly reflect a deletion failure.
- Consider refactoring the yum retry loop into a reusable function to reduce duplication and improve readability.
## Individual Comments
### Comment 1
<location> `scripts/rhel-ha-aws-check.sh:266-267` </location>
<code_context>
+ fi
+ done
+
+ if [ ${SUCCESS} -ne 1 ]; then
+ echo "Install of ${HAPKG} failed after multiple retries."
exit 1
else
rpm -q ${HAPKG}
</code_context>
<issue_to_address>
**suggestion:** Consider logging the specific error encountered after retries fail.
Including the last error message will make it easier to diagnose why the installation failed.
```suggestion
INSTALL_OUTPUT=$(yum -y install ${HAPKG} 2>&1)
if [ $? -eq 0 ]; then
SUCCESS=1
else
LAST_ERROR="${INSTALL_OUTPUT}"
echo "Attempt $((RETRY_COUNT+1)) to install ${HAPKG} failed. Retrying in 5 seconds..."
sleep 5
RETRY_COUNT=$((RETRY_COUNT+1))
fi
done
if [ ${SUCCESS} -ne 1 ]; then
echo "Install of ${HAPKG} failed after multiple retries."
echo "Last error encountered:"
echo "${LAST_ERROR}"
exit 1
else
rpm -q ${HAPKG}
exit 1
fi
```
</issue_to_address>
### Comment 2
<location> `scripts/rhel-ha-aws-check.sh:346` </location>
<code_context>
- exit 1
+ # Check if the resource exists before attempting to delete it
+ # This prevents the script from failing if the resource is already gone.
+ pcs resource config ${RA} &> /dev/null
+ if [ $? -eq 0 ]; then
+ pcs resource delete ${RA}
+ if [ $? -ne 0 ]; then
</code_context>
<issue_to_address>
**suggestion:** Suppressing all output may hide useful error information.
Suppressing stderr may hinder troubleshooting. Consider allowing error messages to be logged or only redirecting stdout.
```suggestion
pcs resource config ${RA} > /dev/null
```
</issue_to_address>
### Comment 3
<location> `scripts/rhel-ha-aws-check.sh:360-362` </location>
<code_context>
fi
if [ $? -eq 0 ]; then
- echo "Got resource configuration for ${RA}."
+ echo "Removal of ${RA} failed."
exit 1
fi
done
</code_context>
<issue_to_address>
**suggestion:** Message may be misleading if resource was never present.
Clarify the message to distinguish between a resource not found and a removal failure.
```suggestion
if [ $? -eq 0 ]; then
echo "Removal of ${RA} failed: resource still present."
exit 1
else
echo "Resource ${RA} not found. Nothing to remove."
exit 1
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider logging the specific error encountered after retries fail.
Including the last error message will make it easier to diagnose why the installation failed.
| exit 1 | |
| fi | |
| INSTALL_OUTPUT=$(yum -y install ${HAPKG} 2>&1) | |
| if [ $? -eq 0 ]; then | |
| SUCCESS=1 | |
| else | |
| LAST_ERROR="${INSTALL_OUTPUT}" | |
| echo "Attempt $((RETRY_COUNT+1)) to install ${HAPKG} failed. Retrying in 5 seconds..." | |
| sleep 5 | |
| RETRY_COUNT=$((RETRY_COUNT+1)) | |
| fi | |
| done | |
| if [ ${SUCCESS} -ne 1 ]; then | |
| echo "Install of ${HAPKG} failed after multiple retries." | |
| echo "Last error encountered:" | |
| echo "${LAST_ERROR}" | |
| exit 1 | |
| else | |
| rpm -q ${HAPKG} | |
| exit 1 | |
| fi |
F-X64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many quick fixes in place and I'm afraid we will make the code harder to maintain long term. Added a suggestion for using functions instead (that's just a quick example).
| exit 1 | ||
| # Check if the resource exists before attempting to delete it | ||
| # This prevents the script from failing if the resource is already gone. | ||
| pcs resource config ${RA} &> /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we moved forward reusable functions and more readable error logging.
A simple encapsulation into a delete resource function would increase readability and maintainability.
It would also avoid having to add explaining comments to our code which in itself is a bad code smell to me.
As an example:
| pcs resource config ${RA} &> /dev/null | |
| delete_resource() { | |
| local RA="$1" | |
| if pcs resource show "$RA" &> /dev/null; then | |
| if pcs resource delete "$RA"; then | |
| echo "Resource '$RA' deleted successfully." | |
| else | |
| echo "failed to delete resource '$RA'." | |
| exit 1 | |
| fi | |
| else | |
| echo "resource '$RA' not found. Skipping deletion." | |
| fi | |
| } | |
| delete_resource "${RA}" |
Transaction test succeeded.\nRunning transaction\n Preparing : 1/1 \n Erasing : bind-utils-32:9.16.23-1.el9_0.10.x86_64 ...e 0 -a 0 -ne 1 ']'\n+ pcs resource cleanup aws-vpc-move-ip\nCleaned up aws-vpc-move-ip on localhost\nConfiguration specifies 'aws-vpc-move-ip' should remain stopped\nWaiting for 1 reply from the controller\n... got reply (done)\n+ '[' 0 -ne 0 -a 0 -ne 1 ']'\n+ pcs resource delete aws-vpc-move-ip\nDeleting Resource - aws-vpc-move-ip\n+ '[' 0 -ne 0 ']'\n+ '[' 9 -lt 8 ']'\n+ pcs resource config aws-vpc-move-ip\nError: unable to find resource 'aws-vpc-move-ip'\n+ '[' 1 -eq 0 ']'\n+ for R in ${RAS}\n++ basename /usr/lib/ocf/resource.d/heartbeat/aws-vpc-route53\n+ RA=aws-vpc-route53\n+ pcs resource disable --wait=5 aws-vpc-route53\nWaiting for the cluster to apply configuration changes (timeout: 5 seconds)...\nresource 'aws-vpc-route53' is not running on any node\n+ '[' 0 -ne 0 -a 0 -ne 1 ']'\n+ pcs resource cleanup aws-vpc-route53\nCleaned up aws-vpc-route53 on localhost\nConfiguration specifies 'aws-vpc-route53' should remain stopped\nWaiting for 1 reply from the controller\n... got reply (done)\n+ '[' 0 -ne 0 -a 0 -ne 1 ']'\n+ pcs resource delete aws-vpc-route53\nError: unable to get cib\n+ '[' 1 -ne 0 ']'\n+ echo 'Cannot delete resource agent aws-vpc-route53 resource.'\n+ exit 1\n").rcyum --setopt=clean_requirements_on_remove=True -y remove bind-utils\n+ '[' 0 -ne 0 ']'\n+ yum -y install bind-utils\n+ '[' 0 -ne 0 ']'\n+ rpm -q bind-utils\n+ '[' 0 -ne 0 ']'\n+ for HAPKG in ${HAPKGS}\n+ rpm -q fence-agents-aws\n+ '[' 1 -eq 0 ']'\n+ yum -y install fence-agents-aws\nRPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)\nError: Could not run transaction.\n+ '[' 1 -ne 0 ']'\n+ echo 'Install of fence-agents-aws failed.'\n+ exit 1\n").rcSummary by Sourcery
Improve error handling and idempotency in the RHEL HA AWS check script by adding installation retries, robust resource cleanup logic, and clearer messaging.
Bug Fixes:
Enhancements: