Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume nightly testing of GDASApp #1313

Open
RussTreadon-NOAA opened this issue Oct 8, 2024 · 3 comments · May be fixed by #1355
Open

Resume nightly testing of GDASApp #1313

RussTreadon-NOAA opened this issue Oct 8, 2024 · 3 comments · May be fixed by #1355
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

Directory ci contains the following scripts

driver.sh  gw_driver.sh  hera.sh  orion.sh  run_ci.sh  run_gw_ci.sh  stable_driver.sh

along with directory validation.

stable_driver.sh was previously run via cron to

  • clone global-workflow develop
  • update jedi hashes in sorc/gdas.cd
  • build g-w with updated sorc/gdas.cd
  • run GDASApp ctests
  • if all tests Passed push updated sorc/gdas.cd to GDASApp feature/stable-nightly

The nightly cron was turned off after several failures.

This issue is opened to document the work needed to resume nightly testing of GDASApp.

@RussTreadon-NOAA
Copy link
Contributor Author

Set up working copy of ci directory in my space on Hera. Turn off mail to Cory and Guillaume. Execute stable_driver.sh. Everything ran fine up to

+ ctest -R gdasapp --output-on-failure

Some of the queued tests failed to run within 1500 seconds of being submitted. Downstream dependent jobs failed.

The following tests FAILED:
        1953 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasstage_ic_202103241200 (Timeout)
        1954 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasfcst_202103241200 (Timeout)
        1955 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasprepoceanobs_202103241800 (Timeout)
        1956 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarinebmat_202103241800 (Timeout)
        1957 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlinit_202103241800 (Failed)
        1958 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlvar_202103241800 (Failed)
        1959 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlchkpt_202103241800 (Failed)
        1960 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlfinal_202103241800 (Failed)
        1966 - test_gdasapp_atm_jjob_var_inc (Failed)
        1967 - test_gdasapp_atm_jjob_var_final (Failed)
        1973 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1974 - test_gdasapp_atm_jjob_ens_final (Failed)

As a result the ctests returned a non-zero return code and the working copy of develop with update jedi hashes was not pushed to feature/stable-nightly.

We need a more robust mechanism to run the ctests. One could submit all the jobs to the debug queue. A potential problem here is that Hera only allows two debug jobs in the queue at a time for a user. stable_driver.sh sequentially runs ctests so this is not an issue. However, if the user were running other debug jobs there could potentially be problems.

As a test set WORKFLOW_BUILD=OFF prior to build. ctests successfully ran 24 non-workflow tests. The following git commands in stable_driver.sh worked

++ cat log.ctest
++ grep 'tests passed'
+ npassed='100% tests passed, 0 tests failed out of 24'
+ '[' 0 -eq 0 ']'
+ echo 'Tests:                                 *SUCCESS*'
++ date
+ echo 'Tests: Completed at Fri Oct  4 02:11:29 UTC 2024'
+ echo 'Tests: 100% tests passed, 0 tests failed out of 24'
+ echo '```'
+ exit 0
+ ci_status=0
+ total=0
+ '[' 0 -eq 0 ']'
+ cd /scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20241004/global-workflow/sorc/gdas.cd
+ git stash
No local changes to save
+ total=0
+ '[' 0 -ne 0 ']'
+ git checkout feature/stable-nightly
warning: unable to rmdir 'sorc/bufr-query': Directory not empty
warning: unable to rmdir 'sorc/da-utils': Directory not empty
Switched to a new branch 'feature/stable-nightly'
M       parm/jcb-algorithms
M       parm/jcb-gdas
M       sorc/fv3-jedi
M       sorc/ioda
M       sorc/iodaconv
M       sorc/jcb
M       sorc/oops
M       sorc/saber
M       sorc/soca
M       sorc/ufo
M       sorc/vader
branch 'feature/stable-nightly' set up to track 'origin/feature/stable-nightly'.
+ total=0
+ '[' 0 -ne 0 ']'

The next git command, git merge develop, failed with

+ git merge develop
Note: Fast-forwarding submodule sorc/fv3-jedi to 731fcf4cbf541f37ac0531b2504fcc4108e1f6ee
Failed to merge submodule sorc/oops (commits don't follow merge-base)
CONFLICT (submodule): Merge conflict in sorc/oops
Recursive merging with submodules currently only supports trivial cases.
Please manually handle the merging of each conflicted submodule.
This can be accomplished with the following steps:
 - go to submodule (sorc/oops), and either merge commit e6485c0a
   or update to an existing commit which has merged those changes
 - come back to superproject and run:

      git add sorc/oops

   to record the above merge or update
 - resolve any other conflicts in the superproject
 - commit the resulting index in the superproject
Automatic merge failed; fix conflicts and then commit the result.
+ total=1
+ '[' 1 -ne 0 ']'
+ echo 'Unable to merge develop'

@RussTreadon-NOAA
Copy link
Contributor Author

Rerun ci/stable_driver.sh on Hera under role.jedipara following merger of g-w PR #2978 into develop. As expected, several GDASApp ctest jobs failed

The following tests FAILED:
        1951 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1963 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlvar_202103241800 (Failed)
        1964 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlchkpt_202103241800 (Failed)
        1965 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlfinal_202103241800 (Failed)
        1969 - test_gdasapp_atm_jjob_var_init (Failed)
        1970 - test_gdasapp_atm_jjob_var_run (Failed)
        1971 - test_gdasapp_atm_jjob_var_inc (Failed)
        1972 - test_gdasapp_atm_jjob_var_final (Failed)
        1974 - test_gdasapp_atm_jjob_ens_letkf (Failed)
        1976 - test_gdasapp_atm_jjob_ens_obs (Failed)
        1977 - test_gdasapp_atm_jjob_ens_sol (Failed)
        1978 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1979 - test_gdasapp_atm_jjob_ens_final (Failed)
        1981 - test_gdasapp_bufr2ioda_insitu_profile_argo (Failed)
        1982 - test_gdasapp_bufr2ioda_insitu_profile_bathy (Failed)
        1983 - test_gdasapp_bufr2ioda_insitu_profile_glider (Failed)
        1984 - test_gdasapp_bufr2ioda_insitu_profile_tesac (Failed)
        1985 - test_gdasapp_bufr2ioda_insitu_profile_tropical (Failed)
        1986 - test_gdasapp_bufr2ioda_insitu_profile_xbtctd (Failed)
        1987 - test_gdasapp_bufr2ioda_insitu_surface_drifter (Failed)
        1988 - test_gdasapp_bufr2ioda_insitu_surface_trkob (Failed)

Below is a preliminary examination of the failures.

test_gdasapp_fv3jedi_fv3inc failed with an error that appears to be related to updated JEDI hashes bringing in changes from the Model Variable Renaming Sprint

fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
Abort(1) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
Abort(1) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
slurmstepd: error: *** STEP 1720377.0 ON h22c26 CANCELLED AT 2024-10-29T13:50:50 ***

test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlvar_202103241800 failed with an error that appears to be related to updated JEDI hashes bringing in changes from the Model Variable Renaming Sprint

 0: OOPS_STATS IncrementalAssimilation iteration 0      - Runtime:    11.16 sec,  Local Memory:   338.18 Mb
 0: Unable to find field metadata for: sea_surface_height_above_geoid
 0: OOPS Ending   2024-10-29 14:15:32 (UTC+0000)
 2: Unable to find field metadata for: sea_surface_height_above_geoid
 4: Unable to find field metadata for: sea_surface_height_above_geoid
 6: Unable to find field metadata for: sea_surface_height_above_geoid
 8: Unable to find field metadata for: sea_surface_height_above_geoid
10: Unable to find field metadata for: sea_surface_height_above_geoid
12: Unable to find field metadata for: sea_surface_height_above_geoid
14: Unable to find field metadata for: sea_surface_height_above_geoid
 0: Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
 2: Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
 4: Abort(1) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
 6: Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
 8: Abort(1) on node 8 (rank 8 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8

test_gdasapp_atm_jjob_var_init failed due to inconsistent variables between g-w and GDASApp. GDASApp test/atm/global-workflow/config.yaml uses JCB_ALGO_YAML_VAR but g-w config.atmanl still uses JCB_ALGO_YAML.

    jcb_algo_config = parse_j2yaml(task_config.JCB_ALGO_YAML, task_config)
  File "/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/ush/python/wxflow/yaml_file.py", line 183, in parse_j2yaml
    raise FileNotFoundError(f"Input j2yaml file {path} does not exist!")
FileNotFoundError: Input j2yaml file @JCB_ALGO_YAML@ does not exist!
+ slurm_script[1]: postamble slurm_script 1730211510 1

test_gdasapp_atm_jjob_ens_letkf and test_gdasapp_atm_jjob_ens_obs failed due to inconsistencies introduced by the updated JEDI hashes which include changes from the Model Variable Renaming Sprint

5: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
5:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
4: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
4:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
1: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
1:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
2: Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
3: Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
5: Abort(1) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
1: Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

All the test_gdasapp_bufr2ioda_insitu_ jobs failed with

  File "/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/ush/ioda/bufr2ioda/marine/b2i/b2iconverter/bufr2ioda_converter.py", line 7, in <module>
    from pyiodaconv import bufr
ModuleNotFoundError: No module named 'pyiodaconv'

@RussTreadon-NOAA
Copy link
Contributor Author

Update
10/29/2024 GFS v18 JEDI Transition tag up outlined strategy for moving forward on this issue.

Create branch feature/resume_nightly for development pertaining to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants