Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case bool dereferencing (consistent with PyTorch) #410

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jrbyrnes
Copy link

@jrbyrnes jrbyrnes commented Feb 9, 2023

A PyTorch Test has (roughly) the following implementation

  uint8_t *x = // some data 0 or 2-255 // host code
  void *temp = x; // host code
  bool *val = static_cast<bool *>(temp); // host code
  // host || device code -- dereference and use 

This is invoking undefined behavior.

This PR identified this problem and resolved it for the CPU case. Basically it special cases dereferencing bool * (via c10::load). And this c10::load is used when iteratively performing the nonzero op on the tensor data.

This patch extends the fix to our iterators.

@nolmoonen
Copy link
Collaborator

@jrbyrnes bool* val = static_cast<bool*>(temp) breaks strict aliasing rules and is undefined behavior. If PyTorch wants to invoke undefined behavior, that's up to PyTorch, but it isn't the responsibility of rocPRIM to account for this.

To work around your problem, you could create a rocprim::transform_iterator to wrap the val pointer, and pass that to rocPRIM.

@jrbyrnes
Copy link
Author

Hi @nolmoonen , thanks for your thoughts! In fact, pushing it back to PyTorch was my first response. However, there has been a specific ask to solve via ROCm since the same behavior (i.e. alias violation) results in test pass transparently for CUDA, yet fails for HIP (ref SWDEV-357998). However, if you are fundamentally opposed to this patch, please confirm and I will see what I can do in PyTorch. Thanks.

@bcahoon
Copy link

bcahoon commented Feb 14, 2023

Hi @nolmoonen, my understanding is that bool *val = static_cast<bool *>(temp) doesn't break the strict aliasing rule if the underlying object type is a character type. In code segment above, it depends on underlying object assigned to x. That said, I'm not implying that rocPRIM is the correct place to fix this issue, as I'm sure there are other users of transform_iterator that would be impacted by this change.

@nolmoonen
Copy link
Collaborator

We are currently investigating the consequences of applying the PR.

Aside from that:
@bcahoon bool* val = static_cast<bool*>(temp) breaks the strict aliasing rule, even if the underlying object type is a character type. The following is would be allowed:

bool* x = // some data, true or false
uint8_t* tmp = reinterpret_cast<uint8_t*>(x);
// dereference tmp and write 0 if evals to false, 2-255 if evals to true, but do check that sizeof(bool) == sizeof(uint8_t)
// dereference x and check that it is equal to original array

https://en.cppreference.com/w/cpp/language/reinterpret_cast section "Type aliasing" states these rules.

@nolmoonen
Copy link
Collaborator

Sorry, correction: the example I gave is also not allowed. While manipulating the data through the char* is allowed, it is not allowed to write a value outside of the range of values that the bool can represent.

@doctorcolinsmith
Copy link
Collaborator

We think this may be fixed in pytorch and therefore not require any change to rocPRIM. @stanleytsang-amd will confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants