-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Conversation
parallel_for
to improve bandwidth utilizationparallel_for
to improve bandwidth utilization
085eaf5
to
505bdf3
Compare
include/oneapi/dpl/pstl/utils.h
Outdated
{ | ||
template <typename _Tp> | ||
void | ||
operator()(__lazy_ctor_storage<_Tp> __storage) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you pass __storage
parameter by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I have made this a l-value reference.
__par_backend_hetero::access_mode::read_write>( | ||
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f); | ||
auto __n = __last1 - __first1; | ||
if (__n <= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the case when __n < 0
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never if a valid sequence is passed :) I switched to __n == 0
.
|
||
// Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search) | ||
template <typename... _Ranges> | ||
class walk_scalar_base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why class walk_scalar_base
declared as class
but
template <typename _ExecutionPolicy, typename _F, typename _Range>
struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>
declared as struct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made them all structs for consistency.
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const | ||
{ | ||
// This is needed to enable vectorization | ||
auto __raw_ptr = __rng.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that
__raw_ptr
isn't very good name becausebegin()
usually linked in mind with iterator. Butraw
usually is some pointer. - Do we really need to have here local variable
__raw_ptr
? Can we pass__rng.begin()
instead of that variable into__vector_walk
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the contexts in which we vectorize, begin() does return pointers, but I agree the name is confusing.
I have addressed this in a different way due to a performance issue. With uint8_t
types, I found the compiler was not properly vectorizing even when calling begin()
on the set of ranges within the kernel leading to performance regressions (about 30% slower than where we should be). Calling begin
from the host and passing it to the submitter to use in the kernel resolves the issue and gives us good performance.
Since begin()
is called on all ranges and passed through the bricks from the submitter, I have switched from the _Rng
naming to _Acc
here as the underlying type may not be a range. Additional template types are also needed.
Update
Please see the comment: #1976 (comment). All of the begin()
calls in this context have been removed.
So now we have 3 entity with defined
Does these constexpr-variables really has different semantic? And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity |
In some moments implementation details remind me But what if we instead of two different functions template <typename _IsFull, typename _ItemId>
void
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
// This is needed to enable vectorization
auto __raw_ptr = __rng.begin();
oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
__raw_ptr);
}
// _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
template <typename _IsFull, typename _ItemId>
void
__scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
{
__f(__rng[__idx]);
} we will have some two functions with the same name and the format excepting the first parameter type which will be used as some Please take a look at |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
One more point: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. I've not gotten to all the details yet, but this is enough to be interesting.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
These three cases are all unique when you consider that they define The three unique cases I mention are the following:
|
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
We were mixing some unique kernel naming strategies in this test (integer call numbers + wrapper classes). The wrapper class approach is the clearest to me here, so I have switched to use it throughout.
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
CPU devices have large max work group size (e.g. 8192) which has led to excessive test times in some Windows + CPU + Debug CI tests. The SYCL parallel for implementation only uses a work-group size of 512, so we limit it to this. Signed-off-by: Matthew Michel <[email protected]> Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…n of functor objects Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
796ce91
to
20aea2d
Compare
High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:
Implementation Details
binary_search
)To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.
This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.