Skip to content

Conversation

@TonyB9000
Copy link
Collaborator

Summary

Unifies the non-blocking zstash behavior between both "create" and "update" operations.

Addresses issue #361,

@TonyB9000 TonyB9000 requested a review from forsyth2 February 21, 2025 16:10
Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 I left some initial review comments. I want to spend more time studying the code to understand how everything gets called/passed around though.

zstash/create.py Outdated
# Transfer to HPSS. Always keep a local copy.
logger.debug(f"{ts_utc()}: calling hpss_put() for {get_db_filename(cache)}")
hpss_put(hpss, get_db_filename(cache), cache, keep=True)
hpss_put(hpss, get_db_filename(cache), cache, keep=args.keep)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically for archiving the database. I think we do want to always keep that, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. That was a mistake. (But it always seems to remain in any case - a mystery)

zstash/create.py Outdated
# (zstash create)
args: argparse.Namespace = parser.parse_args(sys.argv[2:])
if args.hpss and args.hpss.lower() == "none":
if not args.hpss or args.hpss.lower() == "none":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses just for clarity: if (not args.hpss) or (args.hpss.lower() == "none"):

args.hpss args.hpss.lower() == "none" args.non_blocking original behavior new behavior change
T T T args.hpss = "none", args.keep = True args.hpss = "none", args.keep = True N/A
T T F args.hpss = "none" args.hpss = "none", args.keep = True Sets args.keep = True
T F T args.keep = True Nothing No longer sets args.keep = True
T F F Nothing Nothing N/A
F N/A T args.keep = True args.hpss = "none", args.keep = True Sets args.hpss = "none"
F N/A F Nothing args.hpss = "none", args.keep = True Sets args.hpss = "none", args.keep = True

Can you confirm these are the expected changes in behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive that the first two rows? Nothing in that code involves the status of "non-blocking".

Correct me if I'm wrong, but testing "if args.hpss" would only fail if the user included no "hpss" argument on the command line. That should be the same as "hpss=none" (unless some hidden config sets it elsewhere - I did not consider that).

In any case, (to my knowledge), the only time we intend to FORCE "keep" is when hpss=none. According the the "help" text, there is nothing that "non-blocking" (True or False) does to effect "keep".

Thus, rows 3 and 4 should not be seeing "keep = True" if the user did not specify keep.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the combined behavior of

    if args.hpss and args.hpss.lower() == "none":
        args.hpss = "none"
    if args.non_blocking:
        args.keep = True

becoming

if not args.hpss or args.hpss.lower() == "none":
        args.hpss = "none"
        args.keep = True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only fail if the user included no "hpss" argument on the command line.

Correct, and I don't think that is possible because we set it as required:

    required.add_argument(
        "--hpss",
        type=str,
        help=(
            'path to storage on HPSS. Set to "none" for local archiving. It also can be a Globus URL, '
            'globus://<GLOBUS_ENDPOINT_UUID>/<PATH>. Names "alcf" and "nersc" are recognized as referring to the ALCF HPSS '
            "and NERSC HPSS endpoints, e.g. globus://nersc/~/my_archive."
        ),
        required=True,
    )

Thus, rows 3 and 4 should not be seeing "keep = True" if the user did not specify keep.

Ok, that makes sense.

zstash/globus.py Outdated
return True
return False

gv_push = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why gv_push? A more descriptive name might be better. Maybe tar_file_count?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but it was just a way for me to track things. We could change it.

I wanted a variable to track "actual transfer submitted" (pushed), as opposed to just submitted to our globus_transfer() function, which may just add it to a pending transfer and return. I'll make it "gv_tarfiles_pushed".

)
transfer_data.add_item(src_path, dst_path)
transfer_data["label"] = subdir_label + " " + filename
transfer_data["label"] = label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: label is defined to be exactly the same thing above already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

for src_path in prev_transfers:
os.remove(src_path)
prev_transfers = curr_transfers
curr_transfers = list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use = [] instead of = list().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to do that - but was cautioned against it (don't recall why). I'd be happy either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting, I wonder why. = [] definitely seems more "pythonic" to me, as is echoed on https://stackoverflow.com/questions/5790860/whats-the-difference-between-and-vs-list-and-dict.

zstash/update.py Outdated
args: argparse.Namespace = parser.parse_args(sys.argv[2:])
if args.hpss and args.hpss.lower() == "none":

if not args.hpss or args.hpss.lower() == "none":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses, as in create, would be good: if (not args.hpss) or (args.hpss.lower()) == "none":

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I was relying upon the default ("not" applies only the the very next argument). Also to the shortcut-pass where testing (A or B) never tests B when A is true, as it is unnecessary (useful when testing B might cause an exception.

I added the parentheses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to the shortcut-pass where testing (A or B) never tests B when A is true, as it is unnecessary (useful when testing B might cause an exception.

Yes, the parentheses are only for human readers. They shouldn't affect the code at all.

zstash/update.py Outdated

if not args.hpss or args.hpss.lower() == "none":
args.hpss = "none"
args.keep - True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That will make a difference! :) Good catch!

@TonyB9000
Copy link
Collaborator Author

@forsyth2 Allow me to make some changes to address the clear mistakes above. Should take just a moment.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 4, 2025

Allow me to make some changes to address the clear mistakes above. Should take just a moment.

@TonyB9000 Can you push those changes?

I've also reviewed the code logic; this looks good to me, aside from the already suggested changes.

Following the logic of the lists of transferred tars

hpss_utils.add_files -> hpss.hpss_put -> hpss.hpss_transfer:

        if transfer_type == "put":
            if not keep:
                if (scheme != "globus") or (
                    globus_status == "SUCCEEDED"
                ):
                    # Note: This is intended to fulfill the default removal of successfully-transfered
                    # tar files when keep=False, irrespective of non-blocking status
                    logger.info(f"{ts_utc()}: DEBUG: deleting transfered files {prev_transfers}")
                    for src_path in prev_transfers:
                        os.remove(src_path)
                    prev_transfers = curr_transfers
                    curr_transfers = list()

Globus succeeded. We don't have to worry about these tars anymore; they've been transferred.
Delete them and reset the lists.

Earlier in hpss.hpss_transfer, we saw:

curr_transfers.append(file_path)

which is how curr_transfers builds up the list of tars currently being transferred.

Following the logic of `gv_push`

In globus.globus_transfer:

        # DEBUG: review accumulated items in TransferData
        logger.info(f"{ts_utc()}: TransferData: accumulated items:")
        attribs = transfer_data.__dict__
        for item in attribs["data"]["DATA"]:
            if item["DATA_TYPE"] == "transfer_item":
                gv_push += 1
                print(f"   (routine)  PUSHING (#{gv_push}) STORED source item: {item['source_path']}", flush=True)

Increment for every transfer_item we encounter.

In globus.globus_finalize:

    if transfer_data:
        # DEBUG: review accumulated items in TransferData
        logger.info(f"{ts_utc()}: FINAL TransferData: accumulated items:")
        attribs = transfer_data.__dict__
        for item in attribs["data"]["DATA"]:
            if item["DATA_TYPE"] == "transfer_item":
                gv_push += 1
                print(f"    (finalize) PUSHING ({gv_push}) source item: {item['source_path']}", flush=True)

        # SUBMIT new transfer here
        logger.info(f"{ts_utc()}: DIVING: Submit Transfer for {transfer_data['label']}")

Again, increment for every transfer_item we encounter.

gv_push is only ever incremented, never reset to 0. From Tony:

I wanted a variable to track "actual transfer submitted" (pushed), as opposed to just submitted to our globus_transfer() function, which may just add it to a pending transfer and return.

So, gv_push simply counts the number of transfer_items encountered throughout the entire run.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 4, 2025

We'll also need to fix the pre-commit check before merging.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 6, 2025

@TonyB9000 Can you please push those changes you mentioned? I can add a commit fixing the pre-commit checks. I'm hoping to merge this today, so I can make a new zstash release candidate. Thanks!

@TonyB9000
Copy link
Collaborator Author

@forsyth2 I will get that done within the next hour. I've finally gotten "zstash check" to behave as expected. I made a small change to the "polling" frequency in the blcck/wait (so it does not fill the log with hundreds of announcements.

Low-disk condition was a factor in earlier failures. We should employ df-check logic to avoid unexpected out-of-disk-space conditions.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 6, 2025

@TonyB9000 Ok sounds good. Once you push that commit, I'll review the changes and push a commit to fix any pre-commit errors, and then make a zstash RC so @golaz can test in the next Unified RC.

@TonyB9000
Copy link
Collaborator Author

@forsyth2 WHen I push my changes, I get the option:

The upstream branch of your current branch does not match
the name of your current branch.  To push to the upstream branch
on the remote, use

    git push origin HEAD:non-block-testing

To push to the branch of the same name on the remote, use

    git push origin HEAD

I thought I had pushed previously, but may have chosen the wrong option (so you did not see the changes?)

Which should I use? My local branch is named "non-block-testing-fix", but the remote is apparently "non-block-testing".

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 6, 2025

The remote is named non-block-testing-fix too (at the top of this PR page). Try git push origin non-block-testing-fix

@TonyB9000
Copy link
Collaborator Author

OK, that seemed to work.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 6, 2025

I added 734ea5c. I'm getting a couple errors on the unit tests though:

======================================================================
FAIL: testUpdateCacheHPSS (tests.test_update.TestUpdate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 170, in testUpdateCacheHPSS
    self.helperUpdateCache("testUpdateCacheHPSS", HPSS_ARCHIVE)
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 142, in helperUpdateCache
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: The zstash cache does not contain expected files.
It has: ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar']

======================================================================
FAIL: testUpdateKeepHPSS (tests.test_update.TestUpdate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 163, in testUpdateKeepHPSS
    self.helperUpdateKeep("testUpdateKeepHPSS", HPSS_ARCHIVE)
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 112, in helperUpdateKeep
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: The zstash cache does not contain expected files.
It has: ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar']

----------------------------------------------------------------------
Ran 8 tests in 33.143s

FAILED (failures=2)

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 6, 2025

These errors don't appear on main.

pip install . && python -m unittest tests/test_*.py

only gives:

======================================================================
FAIL: testLs (tests.test_globus.TestGlobus)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_globus.py", line 182, in testLs
    self.helperLsGlobus(
  File "/global/u1/f/forsyth/ez/zstash/tests/test_globus.py", line 169, in helperLsGlobus
    self.create(use_hpss, zstash_path, cache=self.cache)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 300, in create
    self.check_strings(cmd, output + err, expected_present, expected_absent)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 187, in check_strings
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: Command=`zstash create --cache=zstash --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/zstash_test/ zstash_test`. Errors=['This was not supposed to be found, but was: ERROR.']

----------------------------------------------------------------------
Ran 69 tests in 415.188s

FAILED (failures=1)

I don't understand how that happened; main was passing the tests when I made zstash v1.4.4rc1.

Actually this error appears on this branch (non-block-testing-fix) too if I run all the tests:

======================================================================
FAIL: testLs (tests.test_globus.TestGlobus)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_globus.py", line 182, in testLs
    self.helperLsGlobus(
  File "/global/u1/f/forsyth/ez/zstash/tests/test_globus.py", line 169, in helperLsGlobus
    self.create(use_hpss, zstash_path, cache=self.cache)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 300, in create
    self.check_strings(cmd, output + err, expected_present, expected_absent)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 187, in check_strings
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: Command=`zstash create --cache=zstash --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/zstash_test/ zstash_test`. Errors=['This was not supposed to be found, but was: ERROR.']

======================================================================
FAIL: testUpdateCacheHPSS (tests.test_update.TestUpdate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 170, in testUpdateCacheHPSS
    self.helperUpdateCache("testUpdateCacheHPSS", HPSS_ARCHIVE)
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 142, in helperUpdateCache
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: The zstash cache does not contain expected files.
It has: ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar']

======================================================================
FAIL: testUpdateKeepHPSS (tests.test_update.TestUpdate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 163, in testUpdateKeepHPSS
    self.helperUpdateKeep("testUpdateKeepHPSS", HPSS_ARCHIVE)
  File "/global/u1/f/forsyth/ez/zstash/tests/test_update.py", line 112, in helperUpdateKeep
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: The zstash cache does not contain expected files.
It has: ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar']

----------------------------------------------------------------------
Ran 69 tests in 390.687s

FAILED (failures=3)

Ok, I'm going to try to debug this and maybe add some more testing (per #367). We can't make a new zstash RC at the moment.

@TonyB9000
Copy link
Collaborator Author

TonyB9000 commented Mar 6, 2025

@forsyth2 I have never run into that error (but I only tested "update" as follows:

zstash create --hpss <the_remote_path> FIRST-set-of-files

(wiped out all local FIRST-set-of-files AND index.db)

zstash update --hpss <the_remote_path> SECOND-set-of-files

(and verified that "remote" contains ALL the files applied)

@TonyB9000
Copy link
Collaborator Author

@forsyth2 There was no overlap between FIRST and SECOND set of files, nor did I try to use the same file(name) with altered content. I was focused only upon the "non-blocking" behavior.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 7, 2025

I've been trying to play around with this, with no real success so far. A couple things:

  1. The Globus test worked on main after I re-authenticated my NERSC endpoint, but it just hangs on this branch. Indeed, if I do the toy problem setup from How do I run the Globus unit test? #329, it works on main, but hangs on this branch.
  2. The update tests pass on main, but not on this branch:
Test name Actual files in the cache Expected files in the cache
testUpdateCacheHPSS ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar'] ["index.db"]
testUpdateKeepHPSS ['index.db', '000001.tar', '000000.tar', '000002.tar', '000003.tar', '000004.tar'] ["index.db", "000003.tar", "000004.tar", "000001.tar", "000002.tar"]

So, the cache test is keeping files unnecessarily and the keep test is keeping 000000.tar unnecessarily (or the expected results need to be updated). (The differing order is fine; the compare function ignores order).

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 7, 2025

@TonyB9000 For the cache test, I notice

        if transfer_type == "put":
            if not keep:
                if (scheme != "globus") or (
                    globus_status == "SUCCEEDED" and not non_blocking
                ):
                    os.remove(file_path)

becomes

        if transfer_type == "put":
            if not keep:
                if (scheme != "globus") or (globus_status == "SUCCEEDED"):
                    # Note: This is intended to fulfill the default removal of successfully-transfered
                    # tar files when keep=False, irrespective of non-blocking status
                    logger.debug(
                        f"{ts_utc()}: deleting transfered files {prev_transfers}"
                    )
                    for src_path in prev_transfers:
                        os.remove(src_path)
                    prev_transfers = curr_transfers
                    curr_transfers = list()

on this branch. That is, now we only remove prev_transfers, whereas before we were deleting the current file_path. This appears to pose a problem when we go to transfer the index.db, since keep is set to True for that file, meaning we never get around to removing prev_transfers.

@TonyB9000
Copy link
Collaborator Author

@forsyth2 I recall complaining that "--keep" itself seemed to work (always Keeping the cache tar files), but when omitted, the behavior was hard to understand - sometimes files would be kept irrespective of the flag. This was true both of "create" and "update". In particular, with non-blocking=True, (where some transfers could involve multiple tarfiles at once), the "SUCCEEDED" reported when submitting a new tar-transfer did not provide which files had previously been transferred successfully, so I could see no mechansim by which they cold be removed.

In blocking mode, this is less a problem, as only ONE tar file is involved in any transfer.

Prior to this branch (and prior to the non-blocking fix, applied to create) tar-files would routinely remain, despite the absence of the _--keep" flag. I could not see a mechanism to conduct the removal reliably.

I wonder of the behavior involves "globus_finalize",

When you say "the cache test" (as opposed to the "keep" test), are you referring to when the user supplies a custom location for the local tar-files with "--cache "?

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 7, 2025

When you say "the cache test" (as opposed to the "keep" test), are you referring to when the user supplies a custom location for the local tar-files with "--cache "?

Yes, I mean the automated test using https://github.com/E3SM-Project/zstash/blob/main/tests/test_update.py#L115 helperUpdateCache.

I wonder of the behavior involves "globus_finalize",

The Globus-specific test is the only automated test for the Globus functionality. That shouldn't be touched in these 2 tests.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 7, 2025

Ok, I've confirmed the issue isn't related to --cache (error still occurs if I do the same steps without that set).

@TonyB9000
Copy link
Collaborator Author

I mention "globus finalize" because it invokes transfers just as the routine "hpss_transfer" does, but may handle the transfers (external to the globus functionality itself) differently.

I an unclear how the tests https://github.com/E3SM-Project/zstash/blob/main/tests/test_update.py#L115 test the new functionality properly. Nor do II understand how "expected behavior" aligns with what the help-text describes. The table:


    # option | Update | UpdateDryRun | UpdateKeep | UpdateCache | TestZstash.add_files (used in multiple tests)|
    # --hpss    |x|x|x|x|x|
    # --cache   | | | |x|b|
    # --dry-run | |x| | | |
    # --keep    | | |x| |b|
    # -v        | | | | |b|

does not distinguish blocking from non-blocking behaviors.

If the previous version "passed" these tests (properly removing the "expected" tar-files), I need to see where in the actual run codes (not these test drivers) the behavior is manifest.

@forsyth2 forsyth2 mentioned this pull request Mar 7, 2025
@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 7, 2025

I added a commit (49fd87b) to debug/improve testing, but I've only run into more issues. I made a stand-alone script version of the unit test, and the script seems to work despite paralleling the unit test almost exactly. Unfortunately, I'm going to need to debug more.

I an unclear how the tests https://github.com/E3SM-Project/zstash/blob/main/tests/test_update.py#L115 test the new functionality properly.

Well, they were testing basic functionality and they shouldn't be broken by adding new functionality.

Nor do II understand how "expected behavior" aligns with what the help-text describes.

If keep isn't specified, we should be removing tars from the cache after they transfer. As mentioned above, that seems to work correctly in my stand-alone script version of the failing test, but not in the failing test itself.

The table [...] does not distinguish blocking from non-blocking behaviors.

The table is from the early days of zstash testing. The Globus functionality has significantly complicated testing (and the functional code itself) so much so that I'm seriously considering possible refactorings -- #370, #367/#369 -- to make it easier to understand. Basically, the non-blocking parameter isn't in that table because it's never tested in the unit tests (only stand-alone scripts we've used for testing).

If the previous version "passed" these tests (properly removing the "expected" tar-files), I need to see where in the actual run codes (not these test drivers) the behavior is manifest.

My answer if the unit test is failing appropriately: I believe the code change in #363 (comment) is what is causing this, but I can't be certain. This is another reason why I think a refactoring might be required -- since we always keep the index.db, we need some way to tell zstash to still remove prev_transfers if we're just dealing with index.db. There's a great deal of state that is hard to follow now. (We can't just delete prev_transfers if keep == True and file_name = "index.db" because it could have very well been the case keep has been True all along).

My answer if the unit test itself is broken (i.e., my stand-alone script is correct): in this case, there'd be nothing of note in the run code itself.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 10, 2025

8050fb5 fixes the zstash update tests, but I'm still debugging the Globus test.

@forsyth2
Copy link
Collaborator

f4a661c fixes the Globus test, but importantly I changed the polling interval back to what it was before. @TonyB9000 is this an acceptable change?

I'm also running into a new error when running all the unit tests, but not when I run the extract tests alone. I suspect this was introduced by the second-to-last commit (8050fb5).

======================================================================
FAIL: testExtractParallelHPSS (tests.test_extract_parallel.TestExtractParallel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/u1/f/forsyth/ez/zstash/tests/test_extract_parallel.py", line 120, in testExtractParallelHPSS
    self.helperExtractParallel("testExtractParallelHPSS", HPSS_ARCHIVE)
  File "/global/u1/f/forsyth/ez/zstash/tests/test_extract_parallel.py", line 86, in helperExtractParallel
    self.stop(error_message)
  File "/global/u1/f/forsyth/ez/zstash/tests/base.py", line 143, in stop
    self.fail(error_message)
AssertionError: The tars were printed in this order: ['000000.tar', "'000000.tar']", '000000.tar', '000000.tar', '000000.tar', '000001.tar', "'000001.tar']", '000001.tar', '000001.tar', '000001.tar', "'000000.tar']", '000002.tar', "'000000.tar',", "'000002.tar']", '000002.tar', '000002.tar', '000002.tar', "'000001.tar']", '000003.tar', "'000001.tar',", "'000003.tar']", '000003.tar', '000003.tar', '000003.tar', '000004.tar', "'000004.tar']", '000004.tar', '000004.tar', '000004.tar']
When it should have been in this order: ["'000000.tar',", "'000000.tar']", "'000000.tar']", "'000001.tar',", "'000001.tar']", "'000001.tar']", "'000002.tar']", "'000003.tar']", "'000004.tar']", '000000.tar', '000000.tar', '000000.tar', '000000.tar', '000001.tar', '000001.tar', '000001.tar', '000001.tar', '000002.tar', '000002.tar', '000002.tar', '000002.tar', '000003.tar', '000003.tar', '000003.tar', '000003.tar', '000004.tar', '000004.tar', '000004.tar', '000004.tar']

@forsyth2
Copy link
Collaborator

I can reproduce that error with pip install . && python -m unittest tests/test_extrac*.py (running just the parallel tests doesn't seem to cause the error, but running at least two test files does) up until removing 49fd87b. That is, 49fd87b caused this issue... but there are only test changes and logger changes in that commit.

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 11, 2025

It turns out the failing test was relying on reading the tars in a certain order, so the extra logging statements messed that up. I just took those extra statements out -- the changes in 59fa442 are enough to get it passing.

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 I think this is good to merge, but before I merge it, a few questions (see associated comments on this review).

Also, per #363 (comment), is the change at f4a661c#diff-883c2a8c42588679fed46ac7b1d96a0497842c87848bcbf10eb4f1733d357d87 reverting the polling_interval an acceptable change?

zstash/globus.py Outdated
return False


# TODO: What does gv stand for? Globus something? Global variable?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global Variable. If I must use them, I like to label them as such.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm going to expand gv to global_variable then, so it's clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. That might discourage people from using them - should be a standard!

last_task_id = None

if transfer_data:
# DEBUG: review accumulated items in TransferData
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a note explaining this code block, right? Not a TODO that still needs to be addressed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

@TonyB9000
Copy link
Collaborator Author

"test was relying on reading the tars in a certain order, so the extra logging statements messed that up" That is weird - but I'd choose to make the comparisons operate over sorted values rather than omit logging messages in general. Maybe these are unnecessary/unhelpful.

@TonyB9000
Copy link
Collaborator Author

@forsyth2 "I changed the polling interval back to what it was before". Yes, that is OK. I was testing whether it was thae cause of my seeing 120+ "success" messages in log output, which seemed to be merely reflecting that the polling interval has been reached.

I would like to refactor/merge both "globus_wait()" and "globus_block_wait()", once we have a solid sense of the desired behavior. I have seen various examples of using task_wait and they are often confusing regarding the relationship between "timeout" and "polling_interval". One behavior I want to avoid is hanging-forever if the transfer itself hangs (returns "ACTIVE" forever.) Hence the timeout-retries code. But then, how to make it large enough when some transfers can take days?

@TonyB9000
Copy link
Collaborator Author

TonyB9000 commented Mar 11, 2025

@forsyth2 I would like (eventually) to have (input) path be an added (optional) parameter for "update", rather than force the user to operate in the source-file directory. It is inconsistent with "create", where you can operate in directory X but load files from directory Y.

@forsyth2
Copy link
Collaborator

"test was relying on reading the tars in a certain order, so the extra logging statements messed that up" That is weird - but I'd choose to make the comparisons operate over sorted values

The issue is that we're testing command line functions, not Python functions. So basically all the "unit" tests ("unit" in quotes because they rely on the system to run and are thus really integration tests) are just checking all output printed to the command line by a command. So, if there are log statements printing out more things, the unit tests can be fooled by earlier output.

"I changed the polling interval back to what it was before". Yes, that is OK.

Ok, great!

I would like to refactor/merge both "globus_wait()" and "globus_block_wait()"
I would like (eventually) to have (input) path be an added (optional) parameter for "update"

Yes, these issues + my comment above about "unit" tests + issues noted on #370 all point to a major refactor being needed. The codebase has become unwieldy to work with, with logic that is hard to follow & test.

Our team is going to have a meeting to plan out the next release once we get this Unified release done. I think as part of that we need to budget time for both 1) figuring out what a zstash refactor would even look like and 2) actually implementing that once decided.

And I think that this refactor design & implementation should be done in tandem with resolving #339 (we're going to need to be thinking about Globus fixes as part of the refactor anyway).

@forsyth2 forsyth2 force-pushed the non-block-testing-fix branch from ad11dd9 to dd227ca Compare March 11, 2025 17:30
@forsyth2 forsyth2 force-pushed the non-block-testing-fix branch from dd227ca to 78476eb Compare March 11, 2025 17:30
@forsyth2 forsyth2 merged commit c96e591 into main Mar 11, 2025
3 checks passed
@forsyth2 forsyth2 deleted the non-block-testing-fix branch March 11, 2025 17:32
@TonyB9000
Copy link
Collaborator Author

@forsyth2 On refactoring zstash/globus: Recall that I have a python workflow (dsm_manage_CMIP_production) that operates "CMIP-dataset-at-a-time" (conditionally zstash-extracting new native data from a local cache-archive when needed, and conditionally fetching a remote archive as needed, etc). This routine will "inherit" the credential-expiration issues of zstash/globus. I have striven to make my codes sufficiently "stateful" that an exit and restart can automatically pick-up where it left off. to avoid unnecessary re-do of efforts. Just something to keep in mind.

I am thinking, any globus transfer that lasts more than 48 hours would certainly involve multiple tar-files, so if there were a way to track per-tar-file completion, the tool should be able to pick-up on a restart and transparently continue a broken set of transfers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants