Skip to content

Commit

Permalink
merge back 6.3 hotfixes to develop (#634)
Browse files Browse the repository at this point in the history
* Remove website URL from comments (#600)

Referencing or using code from some websites is prohibited in this repository.
This change removes an informational reference in the comments.

* Fix rare memory access faults when using internal serial merge (#597)

* test: add tests for internal serial merge function

* refactor(detail/merge_path.hpp): removed code duplication

* fix(detail/merge_path.hpp): stricter boundary checking in serial merge

* fix(detail/block_sort_merge.hpp): fix missing block-wide sync

During a previous refactor, serial_merge does no longer do a block sync. This has now been re-added.

* feat: add unsafe variant of serial merge

* fix: use bounded version for serial merge to fix rare page faults

* test(test_internal_merge_path): clean up internal merge path tests

* style: standardize range_t<> construction

* fix(detail/merge_path.hpp): fix 'range_t<>::count1()' and 'range_t<>::count2()' return types to be same as encapsulated type

* perf(detail/merge_path.hpp): use const ref in function parameters

* refactor(detail/merge_path.hpp): replace redundant use of 'OffsetT' with 'unsigned int'

* chore: update changelog

* fix: restore missing thread sync

This got removed during a rebase.

* Add gfx1151 target (#601) (#603)

Co-authored-by: Stanley Tsang <[email protected]>

* Merge back 6.2 hotfixes (#607) (#620)

* Update dependency names for static builds (#557)

This also removes the line setting `BUILD_SHARED_LIBS` to `ON`, which was previously required to get the correctly named packages when not specifically compiling for a static build. Updates to the ROCmCMakeBuildTools (rocm-cmake) should mean this is no longer necessary.

* Fix BUILD_SHARED_LIBS for packaging (#558)

* Fix the dependencies of the static packages (#563)

* cmake: don't set CMAKE_C_COMPILER, as rocPRIM is a CXX project (#567)

* add developer guidelines (#555) (#574)



* Update Read the Docs config to Python 3.10 and latest rocm-docs-core (#564) (#579)

* Cherry-pick: Optimize block_reduce_warp_reduce when block size is the same as warp size (#599)

* Optimize block_reduce_warp_reduce when block size == warp size

* Make conditional constexpr

* Fix conflict in concepts.rst

---------

Co-authored-by: Lauren Wrubleski <[email protected]>
Co-authored-by: Steve Leung <[email protected]>
Co-authored-by: randyh62 <[email protected]>
Co-authored-by: Nol Moonen <[email protected]>
Co-authored-by: Sam Wu <[email protected]>

* Changed precondition for edge case in serial_merge to prevent assertion error (#622)

* added std::min to ensure no out of bound acess

* fixed typo keys->keys1

* updated changelog

* reverted std::min

* implemented suggested logic

* edited to conform to standards (#618)

* Memory leak fix for multiple rocPRIM unit  tests (#614)

* fixed mem leak in test_config_dispatch.cpp

* added missing hip free for method==4 in test_block_scan.kernels

* added graphHelpeer class that does not cause memory leak due to using hipGraphCreate

* replaced old hipGraph helpers with new class in device_bin_search

* changed HIP_CHECK_NON_VOID to HIP_CHECK

* fixed mem leak in device_bin_search

* added additional functions

* changed out old calls to hipGraphCrete to new GraphHelper class

* added missing stream sync for hipgrag_algs

* n

* added missing hipFree and HIP_CHECK for lookback_reproducibility

* added missing hipFree in test_discard_iterator

* fixed test failures

* removed extra hipFree

* removed unused variables

* updated change log

* removed redundant function

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: root <[email protected]>

* changed to using gHelper

* merged 6.3 version of merge_sort_merge_path (failing large indices test)

* updated mergepath to have fix for large indices

* updated the changelog for 6.3 (#632)

* updated changelog for test_utils_hipgraph change

* updated changelog

---------

Co-authored-by: Wayne Franz <[email protected]>
Co-authored-by: Nara <[email protected]>
Co-authored-by: amd-garydeng <[email protected]>
Co-authored-by: Stanley Tsang <[email protected]>
Co-authored-by: Lauren Wrubleski <[email protected]>
Co-authored-by: Steve Leung <[email protected]>
Co-authored-by: randyh62 <[email protected]>
Co-authored-by: Nol Moonen <[email protected]>
Co-authored-by: Sam Wu <[email protected]>
Co-authored-by: spolifroni-amd <[email protected]>
Co-authored-by: Your Name <[email protected]>
Co-authored-by: root <[email protected]>
  • Loading branch information
13 people authored Nov 1, 2024
1 parent b7af962 commit 44159bd
Show file tree
Hide file tree
Showing 34 changed files with 1,066 additions and 957 deletions.
47 changes: 24 additions & 23 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Changelog for rocPRIM

Documentation for rocPRIM is available at
[https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/).
Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/).

## (Unreleased) rocPRIM 3.4.0 for ROCm 6.4.0

Expand All @@ -19,41 +18,43 @@ Documentation for rocPRIM is available at

### Upcoming changes

## (Unreleased) rocPRIM-3.3.0 for ROCm 6.3.0
## rocPRIM 3.3.0 for ROCm 6.3.0

### Added

* Add --test smoke option in rtest.py. It will run a subset of tests such that the total test time is in 5 minutes. Use python3 ./rtest.py --test smoke or python3 ./rtest.py -t smoke to execute smoke test.
* Option `--seed` to benchmarks to specify a seed for the generation of random inputs. The default behavior is to keep using a random seed per benchmark measurement.
* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and device reduce by key (`rocprim::reduce_by_key`) for improved performance on selected architectures.
* Added `rocprim::uninitialized_array` which provides uninitialized storage in local memory for user-defined types.
* The `--test smoke` option has been added to `rtest.py`. When `rtest.py` is called with this option it runs a subset of tests such that the total test time is 5 minutes. Use `python3 ./rtest.py --test smoke` or `python3 ./rtest.py -t smoke` to run the smoke test.
* The `--seed` option has been added to `run_benchmarks.py`. The `--seed` option specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement.
* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), to device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and to device reduce by key (`rocprim::reduce_by_key`) to improve performance on selected architectures.
* Added `rocprim::uninitialized_array` to provide uninitialized storage in local memory for user-defined types.
* Added large segment support for `rocprim:segmented_reduce`.
* Added a parallel `nth_element` device function similar to `std::nth_element`, this function rearranges elements smaller than the n-th before and bigger than the n-th after the n-th element.
* Added a parallel `nth_element` device function similar to `std::nth_element`. `nth_element` places elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element.
* Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
* Added a parallel `partial_sort` and `partial_sort_copy` device function similar to `std::partial_sort` and `std::partial_sort_copy`, these functions rearranges elements such that the elements are the same as a sorted list up to and including the middle index.
* Added support of sizes larger than 2^32 in `device_merge_sort`.
* Added a parallel `partial_sort` and `partial_sort_copy` device functions similar to `std::partial_sort` and `std::partial_sort_copy`. `partial_sort` and `partial_sort_copy` arrange elements such that the elements are in the same order as a sorted list up to and including the middle index.

### Changed

* Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
* Changed the default seed for `device_benchmark_segmented_reduce`.
* Changed `test_utils_hipgraphs.hpp` to be a class `GraphHelper` with internal graph and graph instances

### Resolved issues
### Removed

* Fixed an issue in rtest.py where if the build folder was made without release or debug directory it would crash the program
* Fixed an issue where while running rtest.py on windows and passing in an absolute path to `--install_dir` causes a `FileNotFound` error.
* rocPRIM functions are no longer forcefully inlined on Windows, significantly reducing the build
time in debug builds.
* `block_load`, `block_store`, `block_shuffle`, `block_exchange` and `warp_exchange` now use placement `new` instead of copy
assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected.
* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would output wrong results.
* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not casted to the correct type resulting in wrong results.
* Fix tests failing when compiling with `-D_GLIBCXX_ASSERTIONS=ON`.
* `rocprim::thread_load()` and `rocprim::thread_store()` have been deprecated. Use `dereference()` instead.

### Upcoming changes
### Resolved issues

* `rocprim::thread_load` and `rocprim::thread_store` will be deprecated. Use dereference instead. Not all of those functions are available on every device architecture, and their usage can hurt performance, because inline assembly inhibits optimizations.
* Resolved an issue in `rtest.py` where it crashed if the `build` folder was created without `release` or `debug` subdirectories.
* Resolved an issue with `rtest.py` on Windows where passing an absolute path to `--install_dir` caused a `FileNotFound` error.
* rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build
time of debug builds.
* `block_load`, `block_store`, `block_shuffle`, `block_exchange`, and `warp_exchange` now use placement `new` instead of copy assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected.
* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would produce incorrect results.
* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not cast to the correct type, resulting in incorrect results.
* Resolved an issue where tests where failing when they were compiled with `-D_GLIBCXX_ASSERTIONS=ON`.
* Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort).
* Fixed memory leaks in unit tests due to missing calls to `hipFree()` and the incorrect use of hipGraphs.
* Fixed an issue where certain inputs to `block_sort_merge()`, `device_merge_sort_merge_path()`, `device_merge()`, and `warp_sort_stable()` caused an assertion error during the call to `serial_merge()`.

## rocPRIM-3.2.1 for ROCm 6.2.1

Expand Down
33 changes: 20 additions & 13 deletions rocprim/include/rocprim/block/detail/block_sort_merge.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
#include "../../detail/merge_path.hpp"
#include "../../detail/various.hpp"
#include "../../warp/detail/warp_sort_stable.hpp"
#include "../../warp/warp_sort.hpp"

BEGIN_ROCPRIM_NAMESPACE

Expand Down Expand Up @@ -385,10 +384,14 @@ class block_sort_merge
diag0_local,
compare_function);
const unsigned int keys2_beg_local = diag0_local - keys1_beg_local;
range_t range_local
= {keys1_beg_local + keys1_beg, keys1_end, keys2_beg_local + keys1_end, keys2_end};

serial_merge(keys_shared, thread_keys, range_local, compare_function);
range_t<> range_local{keys1_beg_local + keys1_beg,
keys1_end,
keys2_beg_local + keys1_end,
keys2_end};

serial_merge<false>(keys_shared, thread_keys, range_local, compare_function);
::rocprim::syncthreads();
}
}

Expand Down Expand Up @@ -426,15 +429,19 @@ class block_sort_merge
diag0_local,
compare_function);
const unsigned int keys2_beg_local = diag0_local - keys1_beg_local;
range_t range_local
= {keys1_beg_local + keys1_beg, keys1_end, keys2_beg_local + keys1_end, keys2_end};

serial_merge(keys_shared,
thread_keys,
values_shared,
thread_values,
range_local,
compare_function);

range_t<> range_local{keys1_beg_local + keys1_beg,
keys1_end,
keys2_beg_local + keys1_end,
keys2_end};

serial_merge<false>(keys_shared,
thread_keys,
values_shared,
thread_values,
range_local,
compare_function);
::rocprim::syncthreads();
}
}
};
Expand Down
Loading

0 comments on commit 44159bd

Please sign in to comment.