-
Notifications
You must be signed in to change notification settings - Fork 10
Non block testing fix #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing additions for activity tracing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TonyB9000 I left some initial review comments. I want to spend more time studying the code to understand how everything gets called/passed around though.
zstash/create.py
Outdated
| # Transfer to HPSS. Always keep a local copy. | ||
| logger.debug(f"{ts_utc()}: calling hpss_put() for {get_db_filename(cache)}") | ||
| hpss_put(hpss, get_db_filename(cache), cache, keep=True) | ||
| hpss_put(hpss, get_db_filename(cache), cache, keep=args.keep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically for archiving the database. I think we do want to always keep that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. That was a mistake. (But it always seems to remain in any case - a mystery)
zstash/create.py
Outdated
| # (zstash create) | ||
| args: argparse.Namespace = parser.parse_args(sys.argv[2:]) | ||
| if args.hpss and args.hpss.lower() == "none": | ||
| if not args.hpss or args.hpss.lower() == "none": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses just for clarity: if (not args.hpss) or (args.hpss.lower() == "none"):
args.hpss |
args.hpss.lower() == "none" |
args.non_blocking |
original behavior | new behavior | change |
|---|---|---|---|---|---|
| T | T | T | args.hpss = "none", args.keep = True |
args.hpss = "none", args.keep = True |
N/A |
| T | T | F | args.hpss = "none" |
args.hpss = "none", args.keep = True |
Sets args.keep = True |
| T | F | T | args.keep = True |
Nothing | No longer sets args.keep = True |
| T | F | F | Nothing | Nothing | N/A |
| F | N/A | T | args.keep = True |
args.hpss = "none", args.keep = True |
Sets args.hpss = "none" |
| F | N/A | F | Nothing | args.hpss = "none", args.keep = True |
Sets args.hpss = "none", args.keep = True |
Can you confirm these are the expected changes in behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive that the first two rows? Nothing in that code involves the status of "non-blocking".
Correct me if I'm wrong, but testing "if args.hpss" would only fail if the user included no "hpss" argument on the command line. That should be the same as "hpss=none" (unless some hidden config sets it elsewhere - I did not consider that).
In any case, (to my knowledge), the only time we intend to FORCE "keep" is when hpss=none. According the the "help" text, there is nothing that "non-blocking" (True or False) does to effect "keep".
Thus, rows 3 and 4 should not be seeing "keep = True" if the user did not specify keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the combined behavior of
if args.hpss and args.hpss.lower() == "none":
args.hpss = "none"
if args.non_blocking:
args.keep = True
becoming
if not args.hpss or args.hpss.lower() == "none":
args.hpss = "none"
args.keep = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only fail if the user included no "hpss" argument on the command line.
Correct, and I don't think that is possible because we set it as required:
required.add_argument(
"--hpss",
type=str,
help=(
'path to storage on HPSS. Set to "none" for local archiving. It also can be a Globus URL, '
'globus://<GLOBUS_ENDPOINT_UUID>/<PATH>. Names "alcf" and "nersc" are recognized as referring to the ALCF HPSS '
"and NERSC HPSS endpoints, e.g. globus://nersc/~/my_archive."
),
required=True,
)Thus, rows 3 and 4 should not be seeing "keep = True" if the user did not specify keep.
Ok, that makes sense.
zstash/globus.py
Outdated
| return True | ||
| return False | ||
|
|
||
| gv_push = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why gv_push? A more descriptive name might be better. Maybe tar_file_count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but it was just a way for me to track things. We could change it.
I wanted a variable to track "actual transfer submitted" (pushed), as opposed to just submitted to our globus_transfer() function, which may just add it to a pending transfer and return. I'll make it "gv_tarfiles_pushed".
| ) | ||
| transfer_data.add_item(src_path, dst_path) | ||
| transfer_data["label"] = subdir_label + " " + filename | ||
| transfer_data["label"] = label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: label is defined to be exactly the same thing above already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
| for src_path in prev_transfers: | ||
| os.remove(src_path) | ||
| prev_transfers = curr_transfers | ||
| curr_transfers = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use = [] instead of = list().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to do that - but was cautioned against it (don't recall why). I'd be happy either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting, I wonder why. = [] definitely seems more "pythonic" to me, as is echoed on https://stackoverflow.com/questions/5790860/whats-the-difference-between-and-vs-list-and-dict.
zstash/update.py
Outdated
| args: argparse.Namespace = parser.parse_args(sys.argv[2:]) | ||
| if args.hpss and args.hpss.lower() == "none": | ||
|
|
||
| if not args.hpss or args.hpss.lower() == "none": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses, as in create, would be good: if (not args.hpss) or (args.hpss.lower()) == "none":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I was relying upon the default ("not" applies only the the very next argument). Also to the shortcut-pass where testing (A or B) never tests B when A is true, as it is unnecessary (useful when testing B might cause an exception.
I added the parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to the shortcut-pass where testing (A or B) never tests B when A is true, as it is unnecessary (useful when testing B might cause an exception.
Yes, the parentheses are only for human readers. They shouldn't affect the code at all.
zstash/update.py
Outdated
|
|
||
| if not args.hpss or args.hpss.lower() == "none": | ||
| args.hpss = "none" | ||
| args.keep - True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! That will make a difference! :) Good catch!
|
@forsyth2 Allow me to make some changes to address the clear mistakes above. Should take just a moment. |
@TonyB9000 Can you push those changes? I've also reviewed the code logic; this looks good to me, aside from the already suggested changes. Following the logic of the lists of transferred tars
if transfer_type == "put":
if not keep:
if (scheme != "globus") or (
globus_status == "SUCCEEDED"
):
# Note: This is intended to fulfill the default removal of successfully-transfered
# tar files when keep=False, irrespective of non-blocking status
logger.info(f"{ts_utc()}: DEBUG: deleting transfered files {prev_transfers}")
for src_path in prev_transfers:
os.remove(src_path)
prev_transfers = curr_transfers
curr_transfers = list()Globus succeeded. We don't have to worry about these tars anymore; they've been transferred. Earlier in curr_transfers.append(file_path)which is how Following the logic of `gv_push`In # DEBUG: review accumulated items in TransferData
logger.info(f"{ts_utc()}: TransferData: accumulated items:")
attribs = transfer_data.__dict__
for item in attribs["data"]["DATA"]:
if item["DATA_TYPE"] == "transfer_item":
gv_push += 1
print(f" (routine) PUSHING (#{gv_push}) STORED source item: {item['source_path']}", flush=True)Increment for every In if transfer_data:
# DEBUG: review accumulated items in TransferData
logger.info(f"{ts_utc()}: FINAL TransferData: accumulated items:")
attribs = transfer_data.__dict__
for item in attribs["data"]["DATA"]:
if item["DATA_TYPE"] == "transfer_item":
gv_push += 1
print(f" (finalize) PUSHING ({gv_push}) source item: {item['source_path']}", flush=True)
# SUBMIT new transfer here
logger.info(f"{ts_utc()}: DIVING: Submit Transfer for {transfer_data['label']}")Again, increment for every
So, |
|
We'll also need to fix the pre-commit check before merging. |
|
@TonyB9000 Can you please push those changes you mentioned? I can add a commit fixing the pre-commit checks. I'm hoping to merge this today, so I can make a new |
|
@forsyth2 I will get that done within the next hour. I've finally gotten "zstash check" to behave as expected. I made a small change to the "polling" frequency in the blcck/wait (so it does not fill the log with hundreds of announcements. Low-disk condition was a factor in earlier failures. We should employ df-check logic to avoid unexpected out-of-disk-space conditions. |
|
@TonyB9000 Ok sounds good. Once you push that commit, I'll review the changes and push a commit to fix any pre-commit errors, and then make a zstash RC so @golaz can test in the next Unified RC. |
|
@forsyth2 WHen I push my changes, I get the option: I thought I had pushed previously, but may have chosen the wrong option (so you did not see the changes?) Which should I use? My local branch is named "non-block-testing-fix", but the remote is apparently "non-block-testing". |
|
The remote is named |
|
OK, that seemed to work. |
|
I added 734ea5c. I'm getting a couple errors on the unit tests though: |
|
These errors don't appear on only gives: I don't understand how that happened; Actually this error appears on this branch ( Ok, I'm going to try to debug this and maybe add some more testing (per #367). We can't make a new |
|
@forsyth2 I have never run into that error (but I only tested "update" as follows: (and verified that "remote" contains ALL the files applied) |
|
@forsyth2 There was no overlap between FIRST and SECOND set of files, nor did I try to use the same file(name) with altered content. I was focused only upon the "non-blocking" behavior. |
|
I've been trying to play around with this, with no real success so far. A couple things:
So, the |
|
@TonyB9000 For the cache test, I notice if transfer_type == "put":
if not keep:
if (scheme != "globus") or (
globus_status == "SUCCEEDED" and not non_blocking
):
os.remove(file_path)becomes if transfer_type == "put":
if not keep:
if (scheme != "globus") or (globus_status == "SUCCEEDED"):
# Note: This is intended to fulfill the default removal of successfully-transfered
# tar files when keep=False, irrespective of non-blocking status
logger.debug(
f"{ts_utc()}: deleting transfered files {prev_transfers}"
)
for src_path in prev_transfers:
os.remove(src_path)
prev_transfers = curr_transfers
curr_transfers = list()on this branch. That is, now we only remove |
|
@forsyth2 I recall complaining that "--keep" itself seemed to work (always Keeping the cache tar files), but when omitted, the behavior was hard to understand - sometimes files would be kept irrespective of the flag. This was true both of "create" and "update". In particular, with non-blocking=True, (where some transfers could involve multiple tarfiles at once), the "SUCCEEDED" reported when submitting a new tar-transfer did not provide which files had previously been transferred successfully, so I could see no mechansim by which they cold be removed. In blocking mode, this is less a problem, as only ONE tar file is involved in any transfer. Prior to this branch (and prior to the non-blocking fix, applied to create) tar-files would routinely remain, despite the absence of the _--keep" flag. I could not see a mechanism to conduct the removal reliably. I wonder of the behavior involves "globus_finalize", When you say "the cache test" (as opposed to the "keep" test), are you referring to when the user supplies a custom location for the local tar-files with "--cache "? |
Yes, I mean the automated test using https://github.com/E3SM-Project/zstash/blob/main/tests/test_update.py#L115
The Globus-specific test is the only automated test for the Globus functionality. That shouldn't be touched in these 2 tests. |
|
Ok, I've confirmed the issue isn't related to |
|
I mention "globus finalize" because it invokes transfers just as the routine "hpss_transfer" does, but may handle the transfers (external to the globus functionality itself) differently. I an unclear how the tests https://github.com/E3SM-Project/zstash/blob/main/tests/test_update.py#L115 test the new functionality properly. Nor do II understand how "expected behavior" aligns with what the help-text describes. The table: does not distinguish blocking from non-blocking behaviors. If the previous version "passed" these tests (properly removing the "expected" tar-files), I need to see where in the actual run codes (not these test drivers) the behavior is manifest. |
|
I added a commit (49fd87b) to debug/improve testing, but I've only run into more issues. I made a stand-alone script version of the unit test, and the script seems to work despite paralleling the unit test almost exactly. Unfortunately, I'm going to need to debug more.
Well, they were testing basic functionality and they shouldn't be broken by adding new functionality.
If
The table is from the early days of
My answer if the unit test is failing appropriately: I believe the code change in #363 (comment) is what is causing this, but I can't be certain. This is another reason why I think a refactoring might be required -- since we always keep the My answer if the unit test itself is broken (i.e., my stand-alone script is correct): in this case, there'd be nothing of note in the run code itself. |
|
8050fb5 fixes the |
|
f4a661c fixes the Globus test, but importantly I changed the polling interval back to what it was before. @TonyB9000 is this an acceptable change? I'm also running into a new error when running all the unit tests, but not when I run the |
|
I can reproduce that error with |
|
It turns out the failing test was relying on reading the tars in a certain order, so the extra logging statements messed that up. I just took those extra statements out -- the changes in 59fa442 are enough to get it passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TonyB9000 I think this is good to merge, but before I merge it, a few questions (see associated comments on this review).
Also, per #363 (comment), is the change at f4a661c#diff-883c2a8c42588679fed46ac7b1d96a0497842c87848bcbf10eb4f1733d357d87 reverting the polling_interval an acceptable change?
zstash/globus.py
Outdated
| return False | ||
|
|
||
|
|
||
| # TODO: What does gv stand for? Globus something? Global variable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global Variable. If I must use them, I like to label them as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I'm going to expand gv to global_variable then, so it's clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. That might discourage people from using them - should be a standard!
| last_task_id = None | ||
|
|
||
| if transfer_data: | ||
| # DEBUG: review accumulated items in TransferData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a note explaining this code block, right? Not a TODO that still needs to be addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct!
|
"test was relying on reading the tars in a certain order, so the extra logging statements messed that up" That is weird - but I'd choose to make the comparisons operate over sorted values rather than omit logging messages in general. Maybe these are unnecessary/unhelpful. |
|
@forsyth2 "I changed the polling interval back to what it was before". Yes, that is OK. I was testing whether it was thae cause of my seeing 120+ "success" messages in log output, which seemed to be merely reflecting that the polling interval has been reached. I would like to refactor/merge both "globus_wait()" and "globus_block_wait()", once we have a solid sense of the desired behavior. I have seen various examples of using task_wait and they are often confusing regarding the relationship between "timeout" and "polling_interval". One behavior I want to avoid is hanging-forever if the transfer itself hangs (returns "ACTIVE" forever.) Hence the timeout-retries code. But then, how to make it large enough when some transfers can take days? |
|
@forsyth2 I would like (eventually) to have (input) path be an added (optional) parameter for "update", rather than force the user to operate in the source-file directory. It is inconsistent with "create", where you can operate in directory X but load files from directory Y. |
The issue is that we're testing command line functions, not Python functions. So basically all the "unit" tests ("unit" in quotes because they rely on the system to run and are thus really integration tests) are just checking all output printed to the command line by a command. So, if there are log statements printing out more things, the unit tests can be fooled by earlier output.
Ok, great!
Yes, these issues + my comment above about "unit" tests + issues noted on #370 all point to a major refactor being needed. The codebase has become unwieldy to work with, with logic that is hard to follow & test. Our team is going to have a meeting to plan out the next release once we get this Unified release done. I think as part of that we need to budget time for both 1) figuring out what a And I think that this refactor design & implementation should be done in tandem with resolving #339 (we're going to need to be thinking about Globus fixes as part of the refactor anyway). |
ad11dd9 to
dd227ca
Compare
dd227ca to
78476eb
Compare
|
@forsyth2 On refactoring zstash/globus: Recall that I have a python workflow (dsm_manage_CMIP_production) that operates "CMIP-dataset-at-a-time" (conditionally zstash-extracting new native data from a local cache-archive when needed, and conditionally fetching a remote archive as needed, etc). This routine will "inherit" the credential-expiration issues of zstash/globus. I have striven to make my codes sufficiently "stateful" that an exit and restart can automatically pick-up where it left off. to avoid unnecessary re-do of efforts. Just something to keep in mind. I am thinking, any globus transfer that lasts more than 48 hours would certainly involve multiple tar-files, so if there were a way to track per-tar-file completion, the tool should be able to pick-up on a restart and transparently continue a broken set of transfers. |
Summary
Unifies the non-blocking zstash behavior between both "create" and "update" operations.
Addresses issue #361,