Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: #54 with fixed merge conflicts. #59

Draft
wants to merge 7 commits into
base: tdv-PR44-fixed-merge-conflict
Choose a base branch
from

Conversation

timothyfrankdavies
Copy link
Collaborator

@timothyfrankdavies timothyfrankdavies commented May 11, 2024

See #57

This PR is #57, but I:

  • Merged automation in a messy way
    • Accepted all current changes to resolve merge conflicts.
    • Realised this branch and tdv-PR44-fixed-merge-conflict were actually one commit behind automation.
    • Cherry-picked the latest commit on this branch.
    • Merged automation into tdv-PR44-fixed-merge-conflict
  • Merged tdv-PR44-fixed-merge-conflict, resolving conflicts by accepting current.
  • Targetted tdv-PR44-fixed-merge-conflict.

I'll use it to leave some review comments, the original branch bmiszalski:bm_work can be updated from automation (Hopefully more cleanly), then this PR can be linked from the other PR and closed.

Brent Miszalski and others added 5 commits May 6, 2024 06:21
@timothyfrankdavies timothyfrankdavies marked this pull request as draft May 11, 2024 03:39
Fixed twilight flats correction, wire solution in red arm and Python version to 3.10
@timothyfrankdavies timothyfrankdavies changed the base branch from automation to tdv-PR44-fixed-merge-conflict May 11, 2024 04:45
Copy link
Collaborator Author

@timothyfrankdavies timothyfrankdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just posting what I have so far, as I'm off work for the rest of the day.

I'll pick up the review again on Monday. Let me know if you have any questions or comments.

Comment on lines -13 to +16
"scipy==1.9.1",
"scipy>=1.9.1",
"numpy",
"matplotlib",
"photutils==1.8.0",
"photutils>=1.8.0",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran into an issue with a later version of scipy at one point: https://github.com/PyWiFeS/pipeline/pull/43/files#r1527907039

Should've added an issue, and it needs steps to reproduce the error before we can investigate it. So for now I think it's ok to proceed with this, but we should retest on @felipeji's setup before we merge.

Comment on lines -60 to +61
lazy_results = pool.imap(_unwrap_and_run, tasks, chunksize=chunksize)
# chunksize is not strictly necessary; can break some pools
lazy_results = pool.imap(_unwrap_and_run, tasks) # , chunksize=chunksize)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule, best to delete code rather than leave it commented out. Otherwise you wind up with a lot of outdated code sitting around 'just-in-case' someone wants to debug or reuse it later.

I'd lean towards leaving chunksize in, but defaulting it to 1 (The default value according to https://docs.python.org/release/2.6.6/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.imap), and moving the calculation to its own 'public' but unused function, with a doc string describing its purpose:


def get_chunksize_for_single_batch(tasks, max_processes):
    """
    Get the `chunksize` to be used such that `tasks` are divided evenly
    between processes, allowing them to be run in a single batch.

    For large numbers of tasks, consider setting a smaller chunksize.

    Note that chunksize is not strictly necessary, and it can break some pools
    """
    num_processes = _get_num_processes(max_processes)

    return math.ceil(len(tasks) / num_processes)


def map_tasks(tasks, max_processes=-1, chunksize=1):
    """
    Run the `tasks`, divided between up to `max_processes` processes in chunks of `chunksize`.

    Each task should follow the pattern in `get_task`, storing the function, args and kwargs to run.

    The results will be returned in order.
    """
    num_processes = _get_num_processes(max_processes)

    results = []
    with multiprocessing.Pool(num_processes) as pool:
        # `lazy_results` must be enumerated within the pool's scope or the threads will not complete.
        lazy_results = pool.imap(_unwrap_and_run, tasks, chunksize=chunksize)
        results = list(lazy_results)

    return results

Comment on lines +1514 to +1517
def run_slice(packaged_args):
"""
A function to be used by multiprocessing in derive_wifes_optical_wave_solution to derive the wifes optical wave solution for each slice `s`.
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a function isn't intended to be used outside the file, best to mark it as private with a leading underscore:
def _run_slice(packaged_args):
task = get_task(_run_slice,

You could then simplify the docstring a little:

"""
Derive the wifes optical wave solution for a single slice `s` in a single thread.
"""

Comment on lines +1518 to +1519
(s,inimg,arc_name,ref_arclines,ref_arcline_file,dlam_cut_start,flux_threshold_nsig,find_method,shift_method,exclude_from,exclude,epsilon,doalphapfit,automatic,verbose,decimate,sigma,alphapfile,doplot,savefigs,save_prefix) = packaged_args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These args should probably be split over the line, but currently the file as a whole hasn't been formatted with Black.

So after this MR, we should follow-up by running black on this file.

For review, I've done a quick find replace to split the line:

  (s,
   inimg,
   arc_name,
   ref_arclines,
   ref_arcline_file,
   dlam_cut_start,
   flux_threshold_nsig,
   find_method,
   shift_method,
   exclude_from,
   exclude,
   epsilon,
   doalphapfit,
   automatic,
   verbose,
   decimate,
   sigma,
   alphapfile,
   doplot,
   savefigs,
   save_prefix) = packaged_args

VSCode highlights a number of those args as unused, which means some unnecessary memory copying to the threads. Might be worth trimming down.

Unused args:

   exclude_from,
   exclude,
   epsilon,
   doalphapfit,
   automatic,


   decimate,
   sigma,
   alphapfile,

   savefigs,
   save_prefix

Comment on lines +1522 to +1524
# since we are using multiprocessing over each slice, no need to use multiprocessing for each call to find_lines_and_guess_refs
multithread=False
max_processes=1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally best to declare new variables as late as possible, to keep them grouped with where they're used.

Also, max_processes shouldn't be necessary to set. It's unused if multithread=False.

flux_threshold_nsig=flux_threshold_nsig,
deriv_threshold_nsig=1.0,
multithread=multithread,
max_processes=max_processes,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the comment on line 1522-1524 (can't link until the review is posted), I'd just set

multithread=False,
plot=step1plot

Skipping max_processes, and just setting multithread directly.

You could add a comment above that parameter just in case someone looks at it and forgets the context that run_slice is singlethreaded.

Comment on lines +1543 to +1551
# save some metadata to the return_dict. since this is always from the f[1] extension, a key without the slice `s` is sufficient
return_dict['grating'] = grating
return_dict['bin_x'] = bin_x
return_dict['bin_y'] = bin_y
return_dict['dateobs'] = dateobs
return_dict['tdk'] = tdk
return_dict['pmb'] = pmb
return_dict['rh'] = rh
return_dict['rma'] = rma
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wound up with a few points in the one comment:

1:

Could declare & initialize the dictionary in one go here:

return_dict = {
  'grating': grating,
  etc.
}

2:

But if these values aren't changing, I'd say it's best to add everything at the end of the function instead.

In part, I think it's easiest to use the function if its return value is all defined in one place, so you can see all the headers.

3:

Might even be better to return a tuple instead of a dictionary:

return (grating, bin_x, bin_y, dateobs, etc.

If any fields aren't clear from the value, you could make them variables first:

slice_rk = new_r
return (..... slice_rk, slice_bla, etc.)

It'd save derive_wifes_optical_wave_solution from having to parse the key names, but it'd need to be careful with the positioning of its args:

(grating, bin_x, ..., slice_rk, slice_bla, ...) = run_slice(packaged_args)

Neither option is really ideal, but it's a private function, so it doesn't need to be perfectly structured. But given we don't currently have unit tests, I think a tuple is the most likely to (correctly) crash out if the private function was changed and the caller wasn't.

@bmiszalski
Copy link

Dear Tim,
Thank you for reviewing my work and thanks for your patience.
Please could you let me know whether you have finished your review of our work?
Provided I make the small changes requested, do you think the resultant branch would then be ready to merge into the automation branch?
Might you have some time to dedicate towards finalising the work you and I have done, so that the automated branch is finalised and brought up to date?
Best regards,
Brent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants