Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop stream 2024-10-29 #631

Open
wants to merge 142 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
fdf01cf
remove HIP-CPU support
Snektron Sep 9, 2024
b8037ab
Resolve: IssueMove ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR to sep…
cenxuantian Sep 5, 2024
170ff5c
rebase and add RETURN_ON_ERROR to the header
cenxuantian Sep 10, 2024
d0d46bf
Added naive implementation for adjacent_find plus tests and benchmarks
Beanavil Aug 23, 2024
feda114
Improved benchmark by only taking into account relevant processed ele…
Beanavil Aug 27, 2024
cfe1849
Use a faster reduction operation
Beanavil Aug 23, 2024
a0d4243
Added block-reduction kernel with early exit
Beanavil Aug 23, 2024
32bad70
Improved test with random first pair
Beanavil Aug 27, 2024
b492542
Get grid_size for maximum occupancy
Beanavil Aug 27, 2024
8e4c886
Improved test coverage
Beanavil Aug 29, 2024
f279aee
Implement early exit with sequential blocks execution
Beanavil Aug 27, 2024
398c92f
Use a dynamic tile_id as in find_first_of for faster stable results
Beanavil Sep 2, 2024
8691694
Added documentation for adjacent_find
Beanavil Sep 4, 2024
111f8dd
Added tuning for adjacent_difference
Beanavil Sep 5, 2024
a873498
Modified tuning so that non-arithmetic types use default configs
Beanavil Sep 5, 2024
d4e75fc
Changed initialization mechanism of kernel's output element
Beanavil Sep 5, 2024
0cbab2e
Fixed tests from review comments
Beanavil Sep 9, 2024
9581dbe
Simplified input transform logic
Beanavil Sep 9, 2024
0f29bff
Added tuned configs
Beanavil Sep 13, 2024
90837c4
Removed duplicated ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR
Beanavil Sep 16, 2024
0b49bd7
Resolve "Refactor benchmarks to use a byte-based size"
MiloLurati Sep 17, 2024
40792ae
Added a rocprim::numeric_limits to support uint128 and int128 and cha…
NB4444 Sep 6, 2024
837a913
Create generate_limit to ensure floating point custom types are handl…
NB4444 Sep 9, 2024
0af9e32
Add rocprim::numeric_limits to numeric_limits_custom_test_type
NB4444 Sep 10, 2024
697f4ed
Expected output fix block_radix_sort test for custom_test_type<float>…
NB4444 Sep 12, 2024
f6785b8
Docs fix numeric_limits
NB4444 Sep 13, 2024
2e480b7
Added numeric_limits to changelog
NB4444 Sep 13, 2024
5742e67
Added a rocprim::uint128_t and rocprim::int128_t
NB4444 Sep 13, 2024
470721d
Implemented find_end with tests and benchmark
NB4444 Aug 16, 2024
23b5f26
Updated find_end benchmark with generate_limits
NB4444 Aug 21, 2024
a2bb18b
Added different input pattern for benchmark and added multiple items per
NB4444 Aug 21, 2024
1f90e8f
Added different key_size to tests for find_end
NB4444 Aug 26, 2024
7df0beb
Added shared memory kernel for find_end
NB4444 Aug 23, 2024
75fbcde
Changed find_end to search with reverse iterator
NB4444 Aug 30, 2024
1017453
Added tests for different compare function
NB4444 Aug 30, 2024
8bd5161
Change benchmark to no longer early exit and choosing shared mem kern…
NB4444 Sep 3, 2024
5eeb19d
Extra check search kernel to prevent unnessary global search
NB4444 Sep 3, 2024
f696c3c
Documentation for find_end
NB4444 Sep 3, 2024
3d53fce
Changed find_end to make it easier to create search
NB4444 Sep 4, 2024
5bd3b32
Fix docs errors find_end
NB4444 Sep 4, 2024
52a6c14
Changes for reviews find_end
NB4444 Sep 6, 2024
2763a84
Fix rebasing issues find_end
NB4444 Sep 9, 2024
acfa119
Added find_end to rocprim header
NB4444 Sep 9, 2024
eb795f8
Fix build error after adding headers
NB4444 Sep 9, 2024
aa53a4f
Use byte-based size in benchmark
NB4444 Sep 13, 2024
dcc6894
Remove double defines
NB4444 Sep 16, 2024
b904bc4
Added search function with tests and benchmark
NB4444 Sep 9, 2024
694eb8d
Fix documentation find_end and search
NB4444 Sep 26, 2024
cf98de8
Add device_search to rocprim.hpp header
NB4444 Sep 27, 2024
ef55bc1
add device_ptr usility
Snektron Sep 30, 2024
0083417
replace high_resolution_clock with steady_clock
Snektron Sep 30, 2024
3d2f0b4
fix values_input generation in segmented radix sort benchmark
Snektron Sep 30, 2024
94b2e8b
properly namespace ROCPRIM_RETURN_ON_ERROR
Snektron Sep 30, 2024
b3c6994
Set c++ version to 17 and create warning
NB4444 Oct 25, 2024
9b6dc7a
Fix no_discard warning c++17
NB4444 Oct 25, 2024
7d2484b
Set CI tests to c++14
NB4444 Oct 25, 2024
e1f1dec
Build for both c++ 14 and 17
NB4444 Oct 25, 2024
122d1ce
Add large sizes test to device_radix_sort
NB4444 Oct 25, 2024
67c5868
Added more test coverage segmented_radix_sort
NB4444 Oct 25, 2024
b64973c
fix not working with const_iterators
parbenc Oct 25, 2024
7dd865a
fix: use bytes instead of size for scan tuning benchmarks
Beanavil Oct 25, 2024
4a20eba
fix: use bytes instead of size for segmented_radix_sort tuning benchm…
Beanavil Oct 25, 2024
5fdf088
Resolve "Partial sort optimization: make use of radix sort"
MiloLurati Oct 25, 2024
f6b75c5
doc: address the upper bound restrictions on Channels for device_hist…
Beanavil Oct 25, 2024
a88d8fb
doc: explicitly state that ActiveChannels is bounded by Channels
Beanavil Oct 25, 2024
6b84c28
batch memcpy tests with random seed
yssamtu Oct 25, 2024
eec2295
follow clang format
yssamtu Oct 25, 2024
12439bf
add newline at the end
yssamtu Oct 25, 2024
516e4b0
make rocprim::reverse_iterator align with that of std
yssamtu Oct 25, 2024
1233e6a
minor change
yssamtu Oct 25, 2024
56a793a
add constexpr
yssamtu Oct 25, 2024
79cb4c6
adjust format
yssamtu Oct 25, 2024
23fcd70
add warnings
yssamtu Oct 25, 2024
ede11d9
adjust format
yssamtu Oct 25, 2024
2303978
change the way of triggering warnings
yssamtu Oct 25, 2024
9d462bf
adjust format
yssamtu Oct 25, 2024
f91747a
minor change
yssamtu Oct 25, 2024
84a86ba
adjust format
yssamtu Oct 25, 2024
bffd719
clear warnings
yssamtu Oct 25, 2024
993c37f
adjust format
yssamtu Oct 25, 2024
9458906
correct warning behaviours
yssamtu Oct 25, 2024
00ecf2e
adjust format
yssamtu Oct 25, 2024
be8d70f
adjust format
yssamtu Oct 25, 2024
8fdf99b
update changelog and fix warning issue
yssamtu Oct 25, 2024
0b271da
fix ambiguous issue
yssamtu Oct 25, 2024
6951912
move a CHANGELOG entry to Deprecations section
yssamtu Oct 25, 2024
d01ee48
feat: add support for predicated flagged device select
Beanavil Oct 25, 2024
20b48e0
feat: add tests (with large indices) for predicated flagged device se…
Beanavil Oct 25, 2024
e1182fd
feat: add config tuning and benchmarks for predicated and flagged dev…
Beanavil Oct 25, 2024
df36611
fix: add missing template parameter to partition-based autotune templ…
Beanavil Oct 25, 2024
2fa4bfb
Add tuned configs
Beanavil Oct 25, 2024
4788289
Fix clang-format hang
Beanavil Oct 25, 2024
d4d90b3
Fix ambiguous error make_reverse_iterator
NB4444 Oct 28, 2024
657d7f0
Resolve "Config tuning and dynamic dispatch for device merge"
MiloLurati Oct 28, 2024
ec2aa44
add search_n algo
cenxuantian Sep 16, 2024
4e1743a
add test
cenxuantian Sep 17, 2024
fc72ce7
Add google test for search_n & tested the functionality
cenxuantian Sep 17, 2024
4bad501
Add benchmark
cenxuantian Sep 17, 2024
22421e3
Add Doc & add custom type for benchmark
cenxuantian Sep 18, 2024
7c5092e
Remove unused variables
cenxuantian Sep 18, 2024
ba6153c
Add NonBlockStream support
cenxuantian Sep 18, 2024
50b63e3
Remove unused type alias
cenxuantian Sep 18, 2024
92fbc87
Refactor search_n for loop, &dit comments &
cenxuantian Sep 20, 2024
3fa70cd
Add More tests & Fixed some bugs
cenxuantian Sep 20, 2024
986a4d5
Add more benckmarks
cenxuantian Sep 24, 2024
ff2b167
Add document
cenxuantian Sep 24, 2024
a898403
Refactor benchmarks
cenxuantian Sep 24, 2024
8e403a1
Replace another DOXYGEN_DOCUMENTATION_BUILD and some minor modifications
cenxuantian Sep 24, 2024
3df360a
Fix build debug error
cenxuantian Sep 24, 2024
8139fbd
Optimize algo with large input
cenxuantian Sep 25, 2024
ef6da2e
add impl2
cenxuantian Sep 26, 2024
04e326f
Optimize
cenxuantian Sep 27, 2024
5108c9d
Move hipMalloc vars to temp_memory
cenxuantian Sep 30, 2024
d3094e8
Rewrite benchmarks
cenxuantian Oct 1, 2024
9f17434
Resolve
cenxuantian Oct 4, 2024
7f39798
Fix bugs -- several occurrences of consecutive full blocks
cenxuantian Oct 7, 2024
874c4e9
Many modifications, fixed the bugs and edited the tests and benchmarks
cenxuantian Oct 9, 2024
73bd933
Optimised the block_search_n_kernel
cenxuantian Oct 9, 2024
40ff702
2nd version search_n implementation for large input
cenxuantian Oct 14, 2024
4897960
Add thread level search_n algorithm
cenxuantian Oct 15, 2024
5518552
Add optimizations
cenxuantian Oct 15, 2024
980054f
Edit benchmarks
cenxuantian Oct 15, 2024
11fc510
remove unused variables
cenxuantian Oct 16, 2024
18bcf5e
remove unused variables and remove __restrict__
cenxuantian Oct 16, 2024
af5f677
fix the bug on windows
cenxuantian Oct 16, 2024
0679cf1
fix bug and modify benchmakrs and tests
cenxuantian Oct 17, 2024
d09f947
fix bugs in benchmarks and search_n_impl
cenxuantian Oct 18, 2024
ac8de49
Oh yes
cenxuantian Oct 23, 2024
b236c1e
Apply 1 suggestion(s) to 1 file(s)
cenxuantian Oct 23, 2024
c51d895
apply some suggestions
cenxuantian Oct 23, 2024
e0dfa8c
edit doc
cenxuantian Oct 23, 2024
a8d3766
replace search_n_min_kernal by rocprim:reduce
cenxuantian Oct 23, 2024
63d315f
fixed some benchmarks bugs
cenxuantian Oct 24, 2024
96ea3c7
remove graph support
cenxuantian Oct 30, 2024
87f63ff
resolve not compile on win
cenxuantian Oct 30, 2024
d403145
Add graph support and modified the design a little
cenxuantian Oct 30, 2024
84c379b
resolve test fail on windws
cenxuantian Oct 30, 2024
b8b6f9f
fix gfx960 benchmark dead lock
cenxuantian Oct 30, 2024
afad937
Add device_search_n to rocprim.hpp
cenxuantian Nov 1, 2024
5659468
replace HIP_CHECK by ROCPRIM_RETURN_ON_ERROR
cenxuantian Nov 1, 2024
64833f4
fix: fix doxygen error due to __launch_bounds__ macro
Beanavil Nov 1, 2024
b388325
Implement 6.3 hotfixes for added/modified tests
Beanavil Nov 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ copyright-date:
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR
-B $BUILD_DIR
- cmake
Expand Down Expand Up @@ -182,6 +183,7 @@ build:cmake-minimum-apt:
-D BUILD_EXAMPLE=ON
-D GPU_TARGETS=$GPU_TARGETS
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-S $CI_PROJECT_DIR
-B $BUILD_DIR
- cmake --build $BUILD_DIR
Expand Down Expand Up @@ -210,6 +212,7 @@ build:cmake-latest:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: [BENCHMARK, TEST]
BUILD_VERSION: [14, 17]

build:cmake-minimum:
needs: []
Expand All @@ -220,6 +223,7 @@ build:cmake-minimum:
matrix:
- BUILD_TYPE: [Debug, Release]
BUILD_TARGET: [BENCHMARK, TEST]
BUILD_VERSION: 14

build:package:
stage: build
Expand All @@ -236,6 +240,7 @@ build:package:
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D CMAKE_CXX_STANDARD=14
-B $PACKAGE_DIR
-S $CI_PROJECT_DIR
- cd $PACKAGE_DIR
Expand Down Expand Up @@ -268,6 +273,7 @@ build:windows:
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
-D CMAKE_CXX_STANDARD=14
- cmake --build "$CI_PROJECT_DIR/build"
artifacts:
paths:
Expand Down Expand Up @@ -314,6 +320,7 @@ autotune:build:
-D GPU_TARGETS=$GPU_TARGETS
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
- cmake --build . --target $BENCHMARK_TARGETS
- 'rm -rf $BUILD_DIR/benchmark/benchmark*.parallel'
# The autotune benchmarks get very large, above GitLabs upload limit. Fortunately they compress well.
Expand All @@ -339,6 +346,7 @@ test:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: TEST
BUILD_VERSION: 14
script:
- cd $BUILD_DIR
- cmake
Expand Down Expand Up @@ -395,6 +403,7 @@ test-windows-release:
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D GPU_TARGETS=$GPU_TARGETS
-D CMAKE_CXX_STANDARD=14
-S "$CI_PROJECT_DIR/test/extra"
-B "$CI_PROJECT_DIR/package_test"
- cmake --build "$CI_PROJECT_DIR/package_test"
Expand All @@ -416,6 +425,7 @@ test:install:
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D CMAKE_CXX_STANDARD=14
-B build
-S $CI_PROJECT_DIR
- $SUDO_CMD cmake --build build --target install
Expand Down Expand Up @@ -458,6 +468,7 @@ benchmark:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: BENCHMARK
BUILD_VERSION: 14
extends:
- .cmake-minimum
- .gpus:rocm
Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,29 @@ Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projec
* Added the parallel `find_first_of` device function with autotuned configurations, this function is similar to `std::find_first_of`, it searches for the first occurrence of any of the provided elements.
* Added `--emulation` option added for `rtest.py`
* Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
* Added a parallel device-level function, `rocprim::adjacent_find`, similar to the C++ Standard Library `std::adjacent_find` algorithm.
* Added configuration autotuning to device adjacent find (`rocprim::adjacent_find`) for improved performance on selected architectures.
* Added rocprim::numeric_limits which is an extension of `std::numeric_limits`, which includes support for 128-bit integers.
* Added rocprim::int128_t and rocprim::uint128_t which are the __int128_t and __uint128_t types.
* Added the parallel `search` and `find_end` device functions similar to `std::search` and `std::find_end`, these functions search for the first and last occurrence of the sequence respectively.
* Added a parallel device-level function, `rocprim::search_n`, similar to the C++ Standard Library `std::search_n` algorithm.
* Added new constructors and a `base` function, and added `constexpr` specifier to all functions in `rocprim::reverse_iterator` to improve parity with the C++17 `std::reverse_iterator`.

### Changed
* Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests.
* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options.

* Removed HIP-CPU support. HIP-CPU support was experimental and broken.
* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.

### Resolved issues
* Fixed an issue where `rmake.py` would generate wrong CMAKE commands while using Linux enviorment
* Fixed an issue where `rocprim::partial_sort_copy` would yield a compile error if the input iterator is const.
* Fixed incorrect 128-bit signed and unsigned integers type traits.
* Fixed compilation issue when `rocprim::radix_key_codec<...>` is specialized with a 128-bit integer.

### Upcoming changes
* Using the initialisation constructor of `rocprim::reverse_iterator` will throw a deprecation warning. It will be marked as explicit in the next major release.

## rocPRIM 3.3.0 for ROCm 6.3.0

Expand Down
77 changes: 39 additions & 38 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ option(BUILD_BENCHMARK "Build benchmarks" OFF)
option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
option(BUILD_EXAMPLE "Build examples" OFF)
option(BUILD_DOCS "Build documentation (requires sphinx)" OFF)
option(USE_HIP_CPU "Prefer HIP-CPU runtime instead of HW acceleration" OFF)
# Disables building tests, benchmarks, examples
option(ONLY_INSTALL "Only install" OFF)
option(BUILD_CODE_COVERAGE "Build with code coverage enabled" OFF)
Expand All @@ -70,50 +69,57 @@ endif()
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE CACHE BOOL "Add paths to linker search and installed rpath")

# Set CXX flags
set(CMAKE_CXX_STANDARD 14)
if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 17)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

if (CMAKE_CXX_STANDARD EQUAL 14)
message(WARNING "C++14 will be deprecated in the next major release")
elseif(NOT CMAKE_CXX_STANDARD EQUAL 17)
message(FATAL_ERROR "Only C++14 and C++17 are supported")
endif()

if(DEFINED BUILD_SHARED_LIBS)
set(PKG_BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS})
else()
else()
set(PKG_BUILD_SHARED_LIBS ON)
endif()
endif()
set(BUILD_SHARED_LIBS OFF) # don't build client dependencies as shared
if(NOT USE_HIP_CPU)
# Get dependencies (required here to get rocm-cmake)
include(cmake/Dependencies.cmake)
# Use target ID syntax if supported for GPU_TARGETS
if (NOT DEFINED AMDGPU_TARGETS)
set(GPU_TARGETS "all" CACHE STRING "GPU architectures to compile for")

# Get dependencies (required here to get rocm-cmake)
include(cmake/Dependencies.cmake)
# Use target ID syntax if supported for GPU_TARGETS
if (NOT DEFINED AMDGPU_TARGETS)
set(GPU_TARGETS "all" CACHE STRING "GPU architectures to compile for")
else()
set(GPU_TARGETS "${AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for")
endif()
set_property(CACHE GPU_TARGETS PROPERTY STRINGS "all")

if(GPU_TARGETS STREQUAL "all")
if(BUILD_ADDRESS_SANITIZER)
# ASAN builds require xnack
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx908:xnack+;gfx90a:xnack+;gfx940:xnack+;gfx941:xnack+;gfx942:xnack+"
)
else()
set(GPU_TARGETS "${AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for")
endif()
set_property(CACHE GPU_TARGETS PROPERTY STRINGS "all")

if(GPU_TARGETS STREQUAL "all")
if(BUILD_ADDRESS_SANITIZER)
# ASAN builds require xnack
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx908:xnack+;gfx90a:xnack+;gfx940:xnack+;gfx941:xnack+;gfx942:xnack+"
)
else()
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx803;gfx900:xnack-;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201"
)
endif()

set(GPU_TARGETS "${DEFAULT_AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for" FORCE)
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx803;gfx900:xnack-;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201"
)
endif()

# TODO: Fix VerifyCompiler for HIP on Windows
if (NOT WIN32)
include(cmake/VerifyCompiler.cmake)
endif()
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH} ${ROCM_PATH}/hip ${ROCM_PATH}/llvm ${ROCM_ROOT}/llvm ${ROCM_ROOT} ${ROCM_ROOT}/hip)
find_package(hip REQUIRED CONFIG PATHS ${HIP_DIR} ${ROCM_PATH} /opt/rocm)
set(GPU_TARGETS "${DEFAULT_AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for" FORCE)
endif()

# TODO: Fix VerifyCompiler for HIP on Windows
if (NOT WIN32)
include(cmake/VerifyCompiler.cmake)
endif()
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH} ${ROCM_PATH}/hip ${ROCM_PATH}/llvm ${ROCM_ROOT}/llvm ${ROCM_ROOT} ${ROCM_ROOT}/hip)
find_package(hip REQUIRED CONFIG PATHS ${HIP_DIR} ${ROCM_PATH} /opt/rocm)

# FOR HANDLING ENABLE/DISABLE OPTIONAL BACKWARD COMPATIBILITY for FILE/FOLDER REORG
option(BUILD_FILE_REORG_BACKWARD_COMPATIBILITY "Build with file/folder reorg with backward compatibility enabled" OFF)
if(ROCPRIM_INSTALL AND BUILD_FILE_REORG_BACKWARD_COMPATIBILITY AND NOT WIN32)
Expand All @@ -130,11 +136,6 @@ if(BUILD_CODE_COVERAGE)
add_link_options(--coverage)
endif()

if(USE_HIP_CPU)
# Get dependencies
include(cmake/Dependencies.cmake)
endif()

# Setup VERSION
set(VERSION_STRING "3.3.0")
rocm_setup_version(VERSION ${VERSION_STRING})
Expand Down
7 changes: 1 addition & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ develop performant GPU-accelerated code on AMD ROCm platforms.
* Including
[HIP-clang](https://github.com/ROCm/HIP/blob/master/INSTALL.md#hip-clang)
compiler
* C++14
* C++17
* Python 3.6 or higher (HIP on Windows only, required only for install script)
* Visual Studio 2019 with Clang support (HIP on Windows only)
* Strawberry Perl (HIP on Windows only)
Expand Down Expand Up @@ -110,11 +110,6 @@ You can build and install rocPRIM on Linux or Windows.
# before 'cmake' or setting cmake option 'CMAKE_CXX_COMPILER' to path to the compiler.
# Using HIP-clang:
[CXX=hipcc] cmake -DBUILD_BENCHMARK=ON ../.
#
# ! EXPERIMENTAL !
# Alternatively one may build using the experimental (and highly incomplete) HIP-CPU back-end for host-side
# execution using any C++17 conforming compiler (supported by HIP-CPU). AMDGPU_* options are unavailable in this case.
# USE_HIP_CPU - OFF by default

# Build
make -j4
Expand Down
26 changes: 8 additions & 18 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,24 +68,10 @@ function(add_rocprim_benchmark BENCHMARK_SOURCE)
rocprim
benchmark::benchmark
)
if(NOT USE_HIP_CPU)
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
rocprim_hip
)
else()
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
Threads::Threads
hip_cpu_rt::hip_cpu_rt
)
if(STL_DEPENDS_ON_TBB)
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
TBB::tbb
)
endif()
endif()
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
rocprim_hip
)

target_compile_options(${BENCHMARK_TARGET}
PRIVATE
Expand Down Expand Up @@ -134,9 +120,11 @@ add_rocprim_benchmark(benchmark_block_scan.cpp)
add_rocprim_benchmark(benchmark_block_sort.cpp)
add_rocprim_benchmark(benchmark_config_dispatch.cpp)
add_rocprim_benchmark(benchmark_device_adjacent_difference.cpp)
add_rocprim_benchmark(benchmark_device_adjacent_find.cpp)
add_rocprim_benchmark(benchmark_device_batch_memcpy.cpp)
add_rocprim_benchmark(benchmark_device_binary_search.cpp)
add_rocprim_benchmark(benchmark_device_find_first_of.cpp)
add_rocprim_benchmark(benchmark_device_find_end.cpp)
add_rocprim_benchmark(benchmark_device_histogram.cpp)
add_rocprim_benchmark(benchmark_device_merge.cpp)
add_rocprim_benchmark(benchmark_device_merge_sort.cpp)
Expand All @@ -156,7 +144,9 @@ add_rocprim_benchmark(benchmark_device_run_length_encode.cpp)
add_rocprim_benchmark(benchmark_device_scan.cpp)
add_rocprim_benchmark(benchmark_device_scan_deterministic.cpp)
add_rocprim_benchmark(benchmark_device_scan_by_key.cpp)
add_rocprim_benchmark(benchmark_device_search.cpp)
add_rocprim_benchmark(benchmark_device_scan_by_key_deterministic.cpp)
add_rocprim_benchmark(benchmark_device_search_n.cpp)
add_rocprim_benchmark(benchmark_device_select.cpp)
add_rocprim_benchmark(benchmark_device_segmented_radix_sort_keys.cpp)
add_rocprim_benchmark(benchmark_device_segmented_radix_sort_pairs.cpp)
Expand Down
8 changes: 8 additions & 0 deletions benchmark/ConfigAutotuneSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ function(read_config_autotune_settings file list_across_names list_across output
set(list_across "${TUNING_TYPES};\
true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_adjacent_find")
set(list_across_names "InputType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@InputType@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_histogram")
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256" PARENT_SCOPE)
Expand Down Expand Up @@ -115,5 +119,9 @@ DataType;BlockSize;" PARENT_SCOPE)
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${LIMITED_TUNING_TYPES};32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_merge")
set(list_across_names "KeyType;ValueType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};rocprim::empty_type ${LIMITED_TUNING_TYPES};32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@KeyType@_@ValueType@_@BlockSize@" PARENT_SCOPE)
endif()
endfunction()
Loading