Skip to content

[5pt] WIP PR: Logging and Error handling upgrades#1731

Draft
RobHanna-NOAA wants to merge 36 commits intodevfrom
dev-logging
Draft

[5pt] WIP PR: Logging and Error handling upgrades#1731
RobHanna-NOAA wants to merge 36 commits intodevfrom
dev-logging

Conversation

@RobHanna-NOAA
Copy link
Copy Markdown
Contributor

@RobHanna-NOAA RobHanna-NOAA commented Jan 13, 2026

Items fixed or adjusted:

  • Logging
    • Considerable changes to .sh files and a few .py files to handle logging and error trapping as a result of a discovery that some critical error information can be either lost entirely or buried in various log files.
    • Error handling at the "processing" level. This problem has been around for years but not previously discovered. Almost all files in the pipeline chain of files bubble up to a few file such as fim_process_huc.sh, process_branch.sh and fim_post_processing.sh. Those files capture all logging info and errors for all child scripts called under it. However, errors in the "process" scripts can occur and have never been managed. A standard bash "trap" error handling system has been added to those key files to help manage bugs in those scripts. When running those scripts on an EC2, they were often seen, but in AWS, those were completely lost loosing critical information.
  • Some of the bash .sh scripts have had their critical page header lines adjusted based on usage of the page. Most have and need "#!/bin/bash -e" with a few are deliberately missing the "-e" portion. The "-e" flag mean stop on error found (but it will still be caught in "traps" error handling if it exists.
  • Some of the bash .sh scripts have specific needs for the final exit status returned such as fim_process_huc.sh must return exit code 0 unless a major catestrophic error. AWS requires this. Some other files like the new process_rerun_calibration_huc.sh are allowed to re-raise or forward true exit codes from it or it's child scripts. These are set on a case-by-case basis depending on usage in other tools, who the parents scripts are and also AWS Step function considerations.
  • Some files gained various commands relating to file cleanup from previous runs and updates to manage permissions to files and folders as they are created. Addition of the "umask" command at key places, allows for full permissions for child folders that are being created by the code which cascades to child files and folders. This was noticed primarily in OWP servers with file and folders having invalid permissions in output files/folders.
  • In the "rerun calibraiton" system, a critical new file was added named "process_rerun_calibration_huc.sh". It was injected between the pre-existing "calibrate_rating_curves.sh" file and the rerun_calibration.py file. This removed a few holes for error trapping and lost logging data. It also allows for better compatiability of the "calibrate_rating_curves.sh" file which is used in both the standard "pipeline" mode and the "rerun calibration" mode.
  • Some files have had various text or notes included.
  • Some files have had their file permissions changed. ie) was 755 to 777 for permissions on file that are used in local EC2 (or similar system), files. Mismatched file permissions which are remembered in the Git repo, can result in file copied down to local development machines that are in read-only mode for some users or user groups. This has been seen mostly in usage of VSCode and OWP servers.
  • Some files have had changes to re-raise exceptions to be caught by higher level parent scripts.
  • Some .sh files that were calling the mpiexec command had an invalid argument submitted based on a variable value that did not exist. The error did not appear to affect the usage of the tool, but removal of the "-n" flag allow for better processing usage for those commands.

Expected results for logging:

  • fim_pipeline.sh: Depending on line of failure, there may be zero hucs, no post-processing attempts, a missing all_errors.log file. etc

  • FIM_process_huc.sh

    • Any and all errors triggered with in the "tee" line executing run_by_huc.sh should be caught. We should always get at least one return_code array item such as codes of 0 for success, or others. Return codes of 60 and 61, we know the specific reason and are considered "true" errors like all other errors.
    • This will cover all exceptions and errors of all types from any child .sh script or child .py downstream such as run_huc.sh, run_branch.sh, calibrate_rating_curves.sh, etc. All of those files should bubble up errors.
    • Some .py handle their own logging as many of the src_calibration.py scripts too. This is acceptable and problems will be found and scanned via the error scanning in fim_process_huc.sh for logging and management. However, each child file of any type must decide on a case-by-case basis if it wants to just record the error and allow the branch or huc to continue processing. Some src_calibration files will need to be reviewed and changed in the future as some are not handling error and logging correctly allowing sometimes for some log data to be lost and some exceptions failing to be re-throw to higher code to stop the branch or huc. These will be fixed at a later time.
    • The critical "trap" command, which is for error handling on the fim_process_huc.sh itself, will catch all errors in lines or functions, below that line. There are places the code that turn the "trap" on and off, but it can not be on for the two lines where we catch the "tee" and put the pipeline value into a variable named "return_codes"
    • fim_pipeline.sh does not have or need a trap system or a log as it is only used in EC2s where screen output is sufficient.
    • various type of word or phrase scanning is done in fim_process_huc.sh and any errors while that scanning is done is expected to be caught and logged including details of where it failed in the fim_process_huc.sh code.
    • It now searches the phrase "parallel" which was discovered could exist as a warning yet was really an error.
    • Other log search phrases have also been added. These might results in duplication of data in all error log files inculding the run level file or all_errors.log and this is considered acceptable.
  • src/process_rerun_calibration_huc.sh is a new file that operates in the same basic functionality as process_huc.sh and process_branch.sh which have "tee" commands and error handling. However, process_rerun_calibration.py is not part of the fim_pipeline.sh chain and will never be called when in pipeline mode. That file is only used in re-run mode which starts with rerun_calibraitonn.py which calls only this new process_rerun_calibration.sh file which handles trapping, logging and errors for calibrate_rating_curves.sh which is used in both pipeline and re-run mode.

  • fim_post_processing.sh has been upgraded to add a "trap" system for error handling with code directly inside the fim_post_processing.sh file. It will continue to do aggregation of all "error" files from each HUC folder, and will do a scan of all branch log files in each HUC directly looking for invalid exit codes. All of the data in the /logs/branch_non_zero_exit_codes.log file. This is the same functionality as before and only helps show branch error code primarily aiming showing if a large number of branch special exit codes such as 62-64 exit. All info in this branch_non_zero_exit_codes.log continues to also exists in the all_errors.log file and is expected to be duplicate information.

  • A new change in this PR, is the removal of a previous folder named /branch_errors/ which had the same list of branch non zero exit codes, which is now moved into the /logs/ folder. Prior to this release, in the /branch_errors/ folder included a copy of the branch log that included an non zero file. Those files are no longer copied as they can easily be traced via the branch_non_zero_exit_codes.log and the all_errors.log files.

  • Use of the l_echo tool should reallly only be used by the three "process..sh" files, fim_pipeline or fim_post_processing. Use of that command by any other child .sh file can create log info that is very out of order due to the nature of the "tee" commands and store up echos and prints to write out at the end of that processing block.

Fixes to be done in future PR's:

  • Review all .py files in the fim_pipeline chain, specifically the "src_calibration" system to keep its file level logging files but update how those are done to ensure log data is not lost. Also review all .py files with the word "except" in them to decided if the exception shoudl be further re-raised to parent scripts in order to stop downstream code processing if applicable. Notes: As it, a few errors in those files can still slip through, but almost all are caught, and the LOE is larger than time permits for this PR.
  • For unknown reasons, some logs are adding a list of python files being used. Those erroneous lines will be removed later. ie). Pipfile Pipfile.lock aws bin. A temp "sed" command has been added to remove those lines from error files in the interium.

More info to be added here.

Additions

Changes

Removals


Testing

Generally, you do not copy this part into the ChangeLog. These are some quick notes on what you did test and/or notes for the reviewer to help with their review testing.


Deployment Plan (For FIM developers use)

  • Does the change impact inputs, docker or python packages?

    • Yes
    • No (f no.. skip the rest of the Deployment Plan section)
  • If you are not a FIM dev team member: Please let us know what you need and we can help with it.

  • If you are a FIM Dev team member:

    • Please work with the DevOps team and do not just go ahead and do it without some co-ordination.

    • Copy where you can, assign where you can not, and it is your responsibility to ensure it is done. Please ensure it is completed before the PR is merged.

    • Has new or updated python packages, PipFile, Pipefile.lock or Dockerfile changes? DevOps can help or take care of it if you want. Just need to know if it is required.

      • Yes
      • No
    • Require new or adjusted data inputs? Does it have a way to version (folder or file dates)?

      • No
      • Yes
        • Require new pre-clip set or any other data reloads, such as DEMS, osm, etc. ie.. pre-requisite re-data upstream of your input changes.
          • Yes
          • No
        • Has the inputs been copied/exist in all four enviros:
          • FIM EFS
          • FIM S3
          • ESIP
          • Dev1
  • Please use caution in removing older version unless it is at least two versions ago. Confirm with DevOps if cleanup might be involved.

  • If new or updated data sets, has the FIM code, including running fim_pipeline.sh, been updated and tested with the new/adjusted data? You can dev test against subsets if you like.

    • Yes

Notes to DevOps Team or others:

Please add any notes that are helpful for us to make sure it is all done correctly. Do not put actual server names or full true paths, just shortcut paths like 'efs..../inputs/, or 'dev1....inputs', etc.


Issuer Checklist (For developer use)

You may update this checklist before and/or after creating the PR. If you're unsure about any of them, please ask, we're here to help! These items are what we are going to look for before merging your code.

  • Informative and human-readable title, using the format: [_pt] PR: <description>
  • Links are provided if this PR resolves an issue, or depends on another other PR
  • If submitting a PR to the dev branch (the default branch), you have a descriptive Feature Branch name using the format: dev-<description-of-change> (e.g. dev-revise-levee-masking)
  • Changes are limited to a single goal (no scope creep)
  • The feature branch you're submitting as a PR is up to date (merged) with the latest dev branch
  • pre-commit hooks were run locally
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • CHANGELOG updated with template version number, e.g. 4.x.x.x
  • Add yourself as an assignee in the PR as well as the FIM Technical Lead

Merge Checklist (For Technical Lead use only)

  • Update CHANGELOG with latest version number and merge date
  • Update the Citation.cff file to reflect the latest version number in the CHANGELOG
  • If applicable, update README with major alterations

@RobHanna-NOAA RobHanna-NOAA self-assigned this Jan 13, 2026
@RobHanna-NOAA RobHanna-NOAA added bug Something isn't working enhancement New feature or request High Priority labels Jan 13, 2026
@RobHanna-NOAA RobHanna-NOAA linked an issue Jan 15, 2026 that may be closed by this pull request
@mluck
Copy link
Copy Markdown
Contributor

mluck commented Feb 3, 2026

This is what was written in all_errors.log as a result of inserting exit 2 in the following files at the line indicated below. Note that there were some cases where all_errors.log was not created and others where all_errors.log was empty. Also, the errors recorded for process_branch.sh returned an exit code of 6 (supposed to be 2).

calibrate_rating_curves:44

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 2 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-calibrate_rating_curves-44-2/03100204/logs/huc_03100204_unit.log:2386:Command exited with non-zero status 2

calibrate_rating_curves:149

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 2 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-calibrate_rating_curves-149-2/03100204/logs/huc_03100204_unit.log:2510:Command exited with non-zero status 2

fim_pipeline:117
No all_errors.log

fim_pipeline:132
all_errors.log empty

fim_post_processing:42
No all_errors.log

fim_post_processing:137
all_errors.log empty

fim_process_huc:140
No all_errors.log

fim_process_huc:203
No all_errors.log

fim_process_huc:253
No all_errors.log

fim_process_huc:257
No all_errors.log

process_branch:20

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 6 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-process_branch-20-2/03100204/logs/huc_03100204_unit.log:342:Command exited with non-zero status 6

process_branch:81

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 6 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-process_branch-81-2/03100204/logs/huc_03100204_unit.log:2372:Command exited with non-zero status 6

run_by_branch:50

/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:346:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:377:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:408:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:439:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:470:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by+branch-50-2/03100204/logs/huc_03100204_unit.log:501:Command exited with non-zero status 2

run_huc:309

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 2 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-run_huc-309-2/03100204/logs/huc_03100204_unit.log:339:Command exited with non-zero status 2

run_huc:330

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 2 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-run_huc-330-2/03100204/logs/huc_03100204_unit.log:2373:Command exited with non-zero status 2

run_huc:361

Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var An Error has occurred - Exit Code is 2 Pipfile Pipfile.lock aws bin boot data dependencies dev entrypoint.sh etc fim-home fim_temp foss_fim home lib lib32 lib64 libx32 media mnt opt outputs proc root run sbin srv sys tmp usr var
/fim_temp/dev-logging-run_huc-361-2/03100204/logs/huc_03100204_unit.log:2524:Command exited with non-zero status 2

@RobHanna-NOAA
Copy link
Copy Markdown
Contributor Author

huh..... I know about the whole "pipfile...." part and don't know how to fix that. I was going to ignore that part for now. I am surprised about some of your other problems. I do expect some duplication in errors but somethign seems weird here.

@RobHanna-NOAA
Copy link
Copy Markdown
Contributor Author

Maybe I checked in a last minute bug

@RobHanna-NOAA
Copy link
Copy Markdown
Contributor Author

dang.. I see my bug. I will get back to you. last minute bug added in my testing

@mluck mluck self-requested a review February 3, 2026 19:59
@RobHanna-NOAA
Copy link
Copy Markdown
Contributor Author

Update: I just talked to Ali.

  • Ali is going to work with Matt and look at the code to get his head around how it works. He is going to finish up src/process_rerun_calibration_huc.sh which might be done and have Matt help test that part. He will also finish the WIP rerun_calibration.py file that I did not finish. It is most done but not fully. Getting your head around how the "tee" commands work with StdERr, StdOut and exit codes is the key to all of these changes. That will help give the final tidbits needed to finish up rerun_calibration.py. Note: I was experimenting in there and did not finish, so some of those changes may no longer have value.
  • Matt will help get Ali up to speed. He has already started testing based on various fail combinations in the "pipeline" workflow. There are some bugs still in that part that I accidently checked in.
  • Rob will fix the bug Matt brought up in messages above, but may not be able to until tomorrow afternoon.

Getting these changes all in are critical for the BED. Well.. . clarification. The Pipeline parts have to be working by this Saturday's BED. If we get cornered on time, we can merge in the WIP changes from process_rerun_calibration_huc.sh and rerun_calibration.py and make a second branch / and PR after the BED to finish getting those working correctly.

@RobHanna-NOAA RobHanna-NOAA changed the title WIP PR: Logging upgrades WIP PR: Logging and Error handling upgrades Feb 3, 2026
@AliForghani-NOAA
Copy link
Copy Markdown
Collaborator

AliForghani-NOAA commented Feb 5, 2026

For rerun calibration code, I fixed three minor bugs. In addition, implemented three design improvements as shown in this commit:

1- Ensure error scanning runs on failure
File: src/process_rerun_calibration_huc.sh
Previously, when a calibration failed, the script exited before scanning logs for errors. The workflow has been updated so that scan_logs_for_errors is always executed before exit $code, ensuring error logs are populated even when the calibration terminates early.

2- Wait for background grep jobs to complete
File: src/process_rerun_calibration_huc.sh (around line 151)
Added a wait command at the end of the scan_logs_for_errors() function to ensure all background grep processes finish before the script exits. This prevents truncated or incomplete error logs.

3- Stream subprocess output in real time
Previously, subprocess.run(capture_output=True) buffered all stdout/stderr in memory and did not display output to the user; logs were only available in files written via tee. This has been replaced with subprocess.Popen(), streaming stdout line-by-line and forwarding it immediately to the screen_queue. As a result, users can now monitor calibration progress in real time while preserving full output for error logging and postmortem analysis.

@RobHanna-NOAA RobHanna-NOAA changed the title WIP PR: Logging and Error handling upgrades [5pt] WIP PR: Logging and Error handling upgrades Feb 5, 2026
@mluck
Copy link
Copy Markdown
Contributor

mluck commented Feb 6, 2026

Here's the result of re-running the previous test of adding exit 2 at the following lines in the following files. I realize that without the trap some of the missing log files may actually be correct.

No all_errors.log file exists for

fim_pipeline:117
fim_post_processing:47
fim_process_huc:154
fim_process_huc:258
fim_process_huc:272

all_errors.log file exists but is empty for

fim_pipeline:132
fim_post_processing:140

Incorrect status error for

/fim_temp/dev-logging-process_branch-21-2/03100204/logs/huc_03100204_unit.log:342:Command exited with non-zero status 6

/fim_temp/dev-logging-process_branch-82-2/03100204/logs/huc_03100204_unit.log:2372:Command exited with non-zero status 6

These appear to be correct all_errors.log:

/fim_temp/dev-logging-calibrate_rating_curves-44-2/03100204/logs/huc_03100204_unit.log:2386:Command exited with non-zero status 2

/fim_temp/dev-logging-calibrate_rating_curves-161-2/03100204/logs/huc_03100204_unit.log:2517:Command exited with non-zero status 2

/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:346:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:377:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:408:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:439:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:470:Command exited with non-zero status 2
/fim_temp/dev-logging-run_by_branch-50-2/03100204/logs/huc_03100204_unit.log:501:Command exited with non-zero status 2

/fim_temp/dev-logging-run_huc-309-2/03100204/logs/huc_03100204_unit.log:339:Command exited with non-zero status 2

/fim_temp/dev-logging-run_huc-362-2/03100204/logs/huc_03100204_unit.log:339:Command exited with non-zero status 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request High Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[13 pts] Logging fixes and improvements

4 participants