Skip to content

Interfaces continuum bg#10

Closed
ckarwin wants to merge 15 commits intoisraelmcmc:interfacesfrom
ckarwin:interfaces_continuum_bg
Closed

Interfaces continuum bg#10
ckarwin wants to merge 15 commits intoisraelmcmc:interfacesfrom
ckarwin:interfaces_continuum_bg

Conversation

@ckarwin
Copy link

@ckarwin ckarwin commented Dec 18, 2025

The continuum BG estimation class is ready to be checked. Below I give a high-level summary of what is contained in this PR and some of the key points.

I have added an example notebook: BG_estimationNN_example.ipynb. There you will find an explanation of the algorithm, as well as examples of the functionality. The tutorial also includes a spectral fit using the estimated BG. The spectral fit follows the same procedure as outlined in the spectral fit example in docs/api/interfaces/examples/crab.

The previous BG estimation class has been removed entirely. The simple interpolation method that was used for the in-painting in the old class has been added as a subclass in the new class (ContinuumEstimationNN.py). The code for the old algorithm has been optimized significantly. It used to take on the order of hours to run this method, and it now takes on the order of seconds. This is nice to have as a baseline of comparison to the NN-based in-painting methods (more on this below).

I updated the tutorial auto-run so that the new tutorial can be ran automatically from the terminal.

This class requires pytorch and torch geometric. For now, I am assuming that these are optional dependencies for cosipy, and so likewise, the class is setup so that nothing should break if the user doesn't have these libraries.

I added unit tests, but they will only run if pytorch and torch geometric are installed. I believe this is coded properly so that the test should just skip if the dependencies are not installed. I'm not sure how this will work out for the coverage requirement of the pull request, but please keep this in mind.

As shown in the tutorial, the NN-based algorithm (using the supervise mode) performs decently: the recovered Crab spectrum is within ~10% of the recovered spectrum using the ideal BG model. The NN-based in-painting performs better than the simple in-painting algorithm, in terms of the recovered Crab spectrum. However, the simple algorithm results in a better likelihood. I don't quite understand this at the moment. In general, the algorithm needs to be tested further and developed to get the optimum results. The good news is that most of the fundamental code is in place now, and so this will make it easier to test and refine.

I spent a lot of time looking over the new refactoring, and trying to understand the logic better. As part of this, I started using pycharm to browse the code, which is helpful. However, for the continuum BG estimation class, I decided not to make this an independent BG protocol. My reasoning here is that I don't think it's really necessary. As shown in the tutorial, I can use the FreeNormBinnedBackground interface and just replace the input BG (h5) file.

I think this might touch on a more general consideration which should be discussed further (maybe not necessarily as part of this PR). It's indeed necessary for the code to be modular, however, at a certain point I think we also run the risk of things becoming unnecessarily complicated. This is something we should avoid.

@ckarwin
Copy link
Author

ckarwin commented Dec 18, 2025

I note that the unit test are failing, but I think that was already an issue, and not due to my changes.

@ckarwin
Copy link
Author

ckarwin commented Dec 19, 2025

I wanted to reiterate one point: I think the NN-based in-painting algorithm can potentially be significantly improved to get much better results. For example, the loss function probably needs to be changed to better suite the problem (and things like this). With the current code, these types of changes can be easily implemented. I plan to work on these improvements in the near future. It will also be a good project for a student or postdoc. But it will be helpful to at least get the code into cosipy for now, and then make further improvements later.

@israelmcmc
Copy link
Owner

@ckarwin I haven't gone through the code yet, but I read your description and comments. A few thought for now:

My reasoning here is that I don't think it's really necessary. As shown in the tutorial, I can use the FreeNormBinnedBackground interface and just replace the input BG (h5) file.
Yes, I agree in this case. The output of your method is a fixed model shape that will just be re-scaled, so there's no need to develop something new. If, in the future, the NN can be generalized to have other parameters that are not normalization factor, then we try to have a separate interface implementation.

I note that the unit test are failing, but I think that was already an issue, and not due to my changes.
Yes, indeed. It's one of the reason I'm waiting to go through this PR. I need to fix things on my end first. Working on it!

I plan to work on these improvements in the near future. It will also be a good project for a student or postdoc. But it will be helpful to at least get the code into cosipy for now, and then make further improvements later.
Yes, I agree

@ckarwin
Copy link
Author

ckarwin commented Feb 10, 2026

Ok, thanks for the update, @israelmcmc.

@israelmcmc
Copy link
Owner

Closing and reopening PR to trigger test now that they have been fixed in the interfaces branch

@israelmcmc israelmcmc closed this Feb 16, 2026
@israelmcmc israelmcmc reopened this Feb 16, 2026
# Conflicts:
#	cosipy/background_estimation/ContinuumEstimation.py
@israelmcmc
Copy link
Owner

@ckarwin I now fixed all the unit test in the interfaces branch, and I also solved the conflict from this PR (I pushed a commit to your branch), so it's all working.

I still need to work on my pytorch setup to be able to run all this, but the tutorial looks good.

One thing I thought: I think ContinuumEstimationNN should inherit from the inherit from ContinuumEstimationInterp, and not the other way around. That way we won't loose the ability to perform the simple background estimation if pytorch is not installed. It will also be easier to have unit test for all the the code that doesn't depend on pytorch.

@ckarwin
Copy link
Author

ckarwin commented Feb 17, 2026

Thanks, @israelmcmc!

With the current setup, the ContinuumEstimationInterp should still work even if the user doesn't have pytorch. The pytorch import is wrapped in a try statement, and if not present, all the needed methods should still work for the simple inpainting.

For the unit test, I put a command at the top that skips the entire test if pytorch isn't available, but I think I can easily change this so that it only skips the NN part.

I suppose it wouldn't be too difficult to make ContinuumEstimationInterp the superclass. But I think the NN method should probably be the main one. What do you think? Do you see some fundamental reason to change this (if things will work without pytorch as is)?

@israelmcmc
Copy link
Owner

When I run pytest, I'm getting

/Users/imartin5/software/cosipy_test/cosipy/background_estimation/ContinuumEstimationNN.py:906: in estimate_bg
    self.inpainted_maps = self.train_inpaint(input_data_map, mask_map, nside,
/Users/imartin5/software/cosipy_test/cosipy/background_estimation/ContinuumEstimationNN.py:422: in train_inpaint
    pred = model(data.x, data.edge_index).squeeze()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch_geometric/nn/models/graph_unet.py:107: in forward
    edge_index, edge_weight = self.augment_adj(edge_index, edge_weight,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = GraphUNet(1, 32, 1, depth=3, pool_ratios=[0.5, 0.5, 0.5])
edge_index = tensor([[ 0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,
          3,  3,  3,  3,  3,  3,  4, ...  6, 10, 11,  9,  6,  2,  7, 11,  8, 10,  7,  3,  4,  8,  9,
          0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]])
edge_weight = tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1....., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
num_nodes = 12

    def augment_adj(self, edge_index: Tensor, edge_weight: Tensor,
                    num_nodes: int) -> PairTensor:
        edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
        edge_index, edge_weight = add_self_loops(edge_index, edge_weight,
                                                 num_nodes=num_nodes)
        adj = to_torch_csr_tensor(edge_index, edge_weight,
                                  size=(num_nodes, num_nodes))
>       adj = (adj @ adj).to_sparse_coo()
               ^^^^^^^^^
E       RuntimeError: addmm: computation on CPU is not implemented for SparseCsr + SparseCsr @ SparseCsr without MKL. PyTorch built with MKL has better support for addmm with sparse CPU tensors.

The problem seems to be that I'm running on an M chip, and does not have an MKL implementation. Following the tip from Mohammad, I hacked select_device to use MPS, which seems to be the appropriate backend for an M chip. After that, I'm getting:

/Users/imartin5/software/cosipy_test/cosipy/background_estimation/ContinuumEstimationNN.py:909: in estimate_bg
    self.inpainted_maps = self.train_inpaint(input_data_map, mask_map, nside,
/Users/imartin5/software/cosipy_test/cosipy/background_estimation/ContinuumEstimationNN.py:425: in train_inpaint
    pred = model(data.x, data.edge_index).squeeze()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch_geometric/nn/models/graph_unet.py:107: in forward
    edge_index, edge_weight = self.augment_adj(edge_index, edge_weight,
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch_geometric/nn/models/graph_unet.py:143: in augment_adj
    adj = to_torch_csr_tensor(edge_index, edge_weight,
/Users/imartin5/software/miniforge3/envs/cosipy_test/lib/python3.12/site-packages/torch_geometric/utils/sparse.py:277: in to_torch_csr_tensor
    crow_indices=index2ptr(edge_index[0], size[0]),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

index = tensor([ 0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,
         2,  2,  2,  3,  3,  3,  3,  3...8,  8,  8,  9,  9,  9,  9,  9,  9,  9, 10, 10,
        10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11], device='mps:0')
size = 12

    def index2ptr(index: Tensor, size: Optional[int] = None) -> Tensor:
        if size is None:
            size = int(index.max()) + 1 if index.numel() > 0 else 0
    
>       return torch._convert_indices_from_coo_to_csr(
            index, size, out_int32=index.dtype != torch.int64)
E       NotImplementedError: The operator 'aten::_convert_indices_from_coo_to_csr.out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 449b1768410104d3ed79d3bcfe4ba1d65c7f22c0. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Mohammad is able to run the tutorial (I haven't tried) and I asked him to run pytest as well. He's also getting a similar error:

test.py:9
  /Users/mabapple2025/Projects/NASA/COSI/ckarwin/cosipy/docs/tutorials/background_estimation/continuum_estimation/test.py:9: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/SparseCsrTensorImpl.cpp:51.)
    csr = torch.sparse_csr_tensor(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================== short test summary info ===========================================================
ERROR test.py - NotImplementedError: Could not run 'new_compressed_tensor' from the 'mps:0' device.)

I think the conclusion is that some of the operations of torch_geometric are simply not supported in an M chip, and these operations were exercised by the test but not the tutorial. Here are a couple of related issue I found:
pyg-team/pytorch_geometric#9567
pyg-team/pytorch_geometric#8977

Now, to proceed:

  • @ckarwin The problem in particular seems to be with CSR sparse matrices. Is that something you used explicitly and potentially be changed to COO, or do you don't have control on this and it's all internally to torch_geometric?
  • I'll look if theres a way to run this tests using the VMs in github for the automatic pipelines.
  • We need to catch the M chip during the installation and warn the user that this is not supported.

@ckarwin
Copy link
Author

ckarwin commented Feb 19, 2026

Thanks for looking into this, @israelmcmc.

I suspect this is do to the use of UNet (from torch_geometric), which is tested in pytest but not the tutorial. In the main code, the model is defined here:
model = PyGGraphUNet(in_channels=1, hidden_channels=32, out_channels=1, depth=3).to(self.device)

The default model of the code is the GCN, and this is what I've mostly been testing so far. In the unit test, you can see where I test the UNet in lines 43 - 48. Can you try commenting these lines out to see if it fixes the problem?

If that is the root of the problem, then it's something within UNet that is not specifically controlled. If we can trace it down to this, it may be possible to come up with a reasonable workaround, if the UNet is really desired, or maybe we can just default to GCN if the UNet is not compatible, e.g. on a M chip mac, as you suggested. What do you think?

@israelmcmc
Copy link
Owner

Thanks @ckarwin . Yes, that makes sense. I'll give it a try commenting out UNet and I'll report back. If that's the problem we can indeed default to another model based on the system.

@mabp1992
Copy link

mabp1992 commented Feb 24, 2026

Thanks for looking into this, @israelmcmc.

I suspect this is do to the use of UNet (from torch_geometric), which is tested in pytest but not the tutorial. In the main code, the model is defined here: model = PyGGraphUNet(in_channels=1, hidden_channels=32, out_channels=1, depth=3).to(self.device)

The default model of the code is the GCN, and this is what I've mostly been testing so far. In the unit test, you can see where I test the UNet in lines 43 - 48. Can you try commenting these lines out to see if it fixes the problem?

If that is the root of the problem, then it's something within UNet that is not specifically controlled. If we can trace it down to this, it may be possible to come up with a reasonable workaround, if the UNet is really desired, or maybe we can just default to GCN if the UNet is not compatible, e.g. on a M chip mac, as you suggested. What do you think?

I did the test as you mentioned and the problem solved. I changed 'unet' to 'gcn' and the unit test executed successfully. One solution could be to design our unet graph neural network and do not use layers that are not supported with MPS. The structure might be different though.

if gpu_count > 1:
logger.info("Multi-GPU training not enabled; using a single GPU.")
return torch.device('cuda')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to add MPS architecture:

elif torch.backends.mps.is_available(): return torch.device('mps')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mabp1992 Can you please clarify exactly where this line should go? For reference, the select_device method is copied below:

    def select_device(self):

        """Device selection with GPU compatibility check."""

        if torch.cuda.is_available():

            gpu_count = torch.cuda.device_count()
            logger.info(f"GPUs available: {gpu_count}")
            capability = torch.cuda.get_device_capability(0)
            major, minor = capability

            # check capatability (may need to be generalized):
            if major < 3 or (major == 3 and minor < 7):
                logger.info(f"[WARNING] GPU {torch.cuda.get_device_name(0)} "
                  f"(capability {major}.{minor}) is not supported by this PyTorch build.")
                logger.info("Falling back to CPU.")
                return torch.device('cpu')

            else:
                logger.info(f"Using GPU 0: {torch.cuda.get_device_name(0)} (capability {major}.{minor})")
                if gpu_count > 1:
                    logger.info("Multi-GPU training not enabled; using a single GPU.")
                return torch.device('cuda')

        else:
            logger.info("No GPU detected. Using CPU.")
            return torch.device('cpu')

@ckarwin
Copy link
Author

ckarwin commented Feb 24, 2026

Thanks for looking into this, @mabp1992! I'm glad to know it resolved the issue.

I agree that a long term solution would be to design the unet to avoid these types of issues. Actually, the unet hasn't been tested much, and it likely needs some more development.

For the short term (with DC4 in view), I think we can do what @israelmcmc suggested, and default to another model based on the system.

@israelmcmc
Copy link
Owner

Yeah, that sounds good. I think that we'll come up with multiple issues like this now that we're starting to use pytorch, libraries for GPU, or anything designed for a specific system. While we are developing all these, I think it's enough to just catching whether the user has an unsupported system (an M chip in this case), and if not, either having a fallback or just failing gracefully with a message explaining the limitation (so people don't spend time debugging a known limitation). The limitation should also be part of the docstrings.

@israelmcmc
Copy link
Owner

I merged interfaces into develop, so I'm closing this in favor of cositools#498

@ckarwin @mabp1992 Please continue the discussion there.

@israelmcmc israelmcmc closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants