Conversation
|
I note that the unit test are failing, but I think that was already an issue, and not due to my changes. |
|
I wanted to reiterate one point: I think the NN-based in-painting algorithm can potentially be significantly improved to get much better results. For example, the loss function probably needs to be changed to better suite the problem (and things like this). With the current code, these types of changes can be easily implemented. I plan to work on these improvements in the near future. It will also be a good project for a student or postdoc. But it will be helpful to at least get the code into cosipy for now, and then make further improvements later. |
|
@ckarwin I haven't gone through the code yet, but I read your description and comments. A few thought for now:
|
|
Ok, thanks for the update, @israelmcmc. |
|
Closing and reopening PR to trigger test now that they have been fixed in the interfaces branch |
# Conflicts: # cosipy/background_estimation/ContinuumEstimation.py
|
@ckarwin I now fixed all the unit test in the interfaces branch, and I also solved the conflict from this PR (I pushed a commit to your branch), so it's all working. I still need to work on my pytorch setup to be able to run all this, but the tutorial looks good. One thing I thought: I think |
|
Thanks, @israelmcmc! With the current setup, the For the unit test, I put a command at the top that skips the entire test if pytorch isn't available, but I think I can easily change this so that it only skips the NN part. I suppose it wouldn't be too difficult to make |
|
When I run pytest, I'm getting The problem seems to be that I'm running on an M chip, and does not have an MKL implementation. Following the tip from Mohammad, I hacked Mohammad is able to run the tutorial (I haven't tried) and I asked him to run pytest as well. He's also getting a similar error: I think the conclusion is that some of the operations of torch_geometric are simply not supported in an M chip, and these operations were exercised by the test but not the tutorial. Here are a couple of related issue I found: Now, to proceed:
|
|
Thanks for looking into this, @israelmcmc. I suspect this is do to the use of UNet (from torch_geometric), which is tested in pytest but not the tutorial. In the main code, the model is defined here: The default model of the code is the GCN, and this is what I've mostly been testing so far. In the unit test, you can see where I test the UNet in lines 43 - 48. Can you try commenting these lines out to see if it fixes the problem? If that is the root of the problem, then it's something within UNet that is not specifically controlled. If we can trace it down to this, it may be possible to come up with a reasonable workaround, if the UNet is really desired, or maybe we can just default to GCN if the UNet is not compatible, e.g. on a M chip mac, as you suggested. What do you think? |
|
Thanks @ckarwin . Yes, that makes sense. I'll give it a try commenting out UNet and I'll report back. If that's the problem we can indeed default to another model based on the system. |
I did the test as you mentioned and the problem solved. I changed 'unet' to 'gcn' and the unit test executed successfully. One solution could be to design our unet graph neural network and do not use layers that are not supported with MPS. The structure might be different though. |
| if gpu_count > 1: | ||
| logger.info("Multi-GPU training not enabled; using a single GPU.") | ||
| return torch.device('cuda') | ||
|
|
There was a problem hiding this comment.
It is better to add MPS architecture:
elif torch.backends.mps.is_available(): return torch.device('mps')
There was a problem hiding this comment.
Hi @mabp1992 Can you please clarify exactly where this line should go? For reference, the select_device method is copied below:
def select_device(self):
"""Device selection with GPU compatibility check."""
if torch.cuda.is_available():
gpu_count = torch.cuda.device_count()
logger.info(f"GPUs available: {gpu_count}")
capability = torch.cuda.get_device_capability(0)
major, minor = capability
# check capatability (may need to be generalized):
if major < 3 or (major == 3 and minor < 7):
logger.info(f"[WARNING] GPU {torch.cuda.get_device_name(0)} "
f"(capability {major}.{minor}) is not supported by this PyTorch build.")
logger.info("Falling back to CPU.")
return torch.device('cpu')
else:
logger.info(f"Using GPU 0: {torch.cuda.get_device_name(0)} (capability {major}.{minor})")
if gpu_count > 1:
logger.info("Multi-GPU training not enabled; using a single GPU.")
return torch.device('cuda')
else:
logger.info("No GPU detected. Using CPU.")
return torch.device('cpu')
|
Thanks for looking into this, @mabp1992! I'm glad to know it resolved the issue. I agree that a long term solution would be to design the unet to avoid these types of issues. Actually, the unet hasn't been tested much, and it likely needs some more development. For the short term (with DC4 in view), I think we can do what @israelmcmc suggested, and default to another model based on the system. |
|
Yeah, that sounds good. I think that we'll come up with multiple issues like this now that we're starting to use pytorch, libraries for GPU, or anything designed for a specific system. While we are developing all these, I think it's enough to just catching whether the user has an unsupported system (an M chip in this case), and if not, either having a fallback or just failing gracefully with a message explaining the limitation (so people don't spend time debugging a known limitation). The limitation should also be part of the docstrings. |
|
I merged |
The continuum BG estimation class is ready to be checked. Below I give a high-level summary of what is contained in this PR and some of the key points.
I have added an example notebook:
BG_estimationNN_example.ipynb. There you will find an explanation of the algorithm, as well as examples of the functionality. The tutorial also includes a spectral fit using the estimated BG. The spectral fit follows the same procedure as outlined in the spectral fit example indocs/api/interfaces/examples/crab.The previous BG estimation class has been removed entirely. The simple interpolation method that was used for the in-painting in the old class has been added as a subclass in the new class (
ContinuumEstimationNN.py). The code for the old algorithm has been optimized significantly. It used to take on the order of hours to run this method, and it now takes on the order of seconds. This is nice to have as a baseline of comparison to the NN-based in-painting methods (more on this below).I updated the tutorial auto-run so that the new tutorial can be ran automatically from the terminal.
This class requires pytorch and torch geometric. For now, I am assuming that these are optional dependencies for cosipy, and so likewise, the class is setup so that nothing should break if the user doesn't have these libraries.
I added unit tests, but they will only run if pytorch and torch geometric are installed. I believe this is coded properly so that the test should just skip if the dependencies are not installed. I'm not sure how this will work out for the coverage requirement of the pull request, but please keep this in mind.
As shown in the tutorial, the NN-based algorithm (using the supervise mode) performs decently: the recovered Crab spectrum is within ~10% of the recovered spectrum using the ideal BG model. The NN-based in-painting performs better than the simple in-painting algorithm, in terms of the recovered Crab spectrum. However, the simple algorithm results in a better likelihood. I don't quite understand this at the moment. In general, the algorithm needs to be tested further and developed to get the optimum results. The good news is that most of the fundamental code is in place now, and so this will make it easier to test and refine.
I spent a lot of time looking over the new refactoring, and trying to understand the logic better. As part of this, I started using pycharm to browse the code, which is helpful. However, for the continuum BG estimation class, I decided not to make this an independent BG protocol. My reasoning here is that I don't think it's really necessary. As shown in the tutorial, I can use the
FreeNormBinnedBackgroundinterface and just replace the input BG (h5) file.I think this might touch on a more general consideration which should be discussed further (maybe not necessarily as part of this PR). It's indeed necessary for the code to be modular, however, at a certain point I think we also run the risk of things becoming unnecessarily complicated. This is something we should avoid.