Skip to content

Conversation

@adityacodes30
Copy link

@adityacodes30 adityacodes30 commented Jul 17, 2025

Integrate HPX algorithms with Nvidia CCCL (Thrust)

This PR implements integration between NVIDIA Thrust and HPX, enabling HPX algorithms to dispatch to GPU-accelerated Thrust implementations through HPX's execution policy system.

Google Summer of Code 2025 Project: HPX-Thrust Integration

Completed Milestones

Phase 1: Foundation Architecture

  • Execution Policy System - Complete policy hierarchy with sync/async support
    • thrust_policy - Host execution with thrust::host
    • thrust_device_policy - Device execution with thrust::device
    • thrust_task_policy - Asynchronous GPU execution with HPX futures

Phase 2: Algorithm Integration

  • Universal Algorithm Dispatch - Single tag_invoke overload supporting all algorithms
  • Algorithm Mapping - 50+ HPX algorithms mapped to Thrust equivalents:
    Core: fill, copy, transform, for_each, generate
    Reductions: reduce, count, count_if, all_of, any_of, none_of
    Search: find, find_if, find_first_of, equal, mismatch
    Sorting: sort, stable_sort, partial_sort, is_sorted
    Modifying: reverse, unique, remove_if, replace
    Numeric: inclusive_scan, exclusive_scan, transform_reduce
    

Phase 3: Asynchronous Execution

  • HPX Future Integration - Async policies return hpx::future<T>
  • CUDA Stream Management - Proper stream handling via hpx::cuda::experimental::target
  • Non-blocking Execution - Using thrust::cuda::par_nosync for async operations
  • Event-based Synchronization - HPX future completion tied to CUDA events

Phase 4: Testing & Validation

  • Test Suite (9 tests):
    Test 1-2: Synchronous host/device execution
    Test 3-5: Multiple algorithms (fill, transform, reduce)
    Test 6-8: Asynchronous execution with explicit/default targets
    
  • Build System Integration - CMake support with CUDA compilation
  • Memory Management - thrust::device_vector usage patterns

Phase 5: Documentation & Examples

  • Documentation - Usage patterns, API reference, integration guide
  • Examples - 7 example programs demonstrating featuresz

Technical Architecture

Core Components

// 1. Execution Policies (policy.hpp)
hpx::thrust::thrust_policy policy;           // Host execution
hpx::thrust::thrust_device_policy dev_policy; // Device execution  
hpx::thrust::thrust_task_policy task_policy;  // Async execution

// 2. Universal Algorithm Dispatch (algorithms.hpp)
template<typename HPXTag, typename ThrustPolicy, typename... Args>
auto tag_invoke(HPXTag, ThrustPolicy&&, Args&&...) 

// 3. Algorithm Mapping (algorithm_map.hpp) 
template<> struct algorithm_map<hpx::fill_t> {
    template<typename Policy, typename... Args>
    static constexpr decltype(auto) invoke(Policy&&, Args&&...);
};

Usage Examples

// Synchronous GPU execution
thrust::device_vector<int> data(1000);
hpx::fill(hpx::thrust::thrust_device_policy{}, data.begin(), data.end(), 42);

// Asynchronous GPU execution with futures
auto task_policy = hpx::thrust::thrust_task_policy{};
auto future = hpx::transform(task_policy, input.begin(), input.end(), 
                            output.begin(), [](int x) { return x * 2; });
auto result = future.get(); // Synchronize with GPU completion

Files Added

source/hpx/libs/core/thrust/
├── include/hpx/thrust/
│   ├── policy.hpp                     Execution policies
│   ├── algorithms.hpp               Universal dispatch
│   └── detail/algorithm_map.hpp     Algorithm mappings
├── tests/unit/
│   └── thrust_policy_test.cu         Comprehensive tests
├── examples/                         example programs
├── docs/index.rst                    Documentation
└── CMakeLists.txt                    Build configuration

Any background context you want to provide?

Implementation for HPX-Thrust integration. This provides the foundation for Thrust GPU-accelerated parallel algorithms through HPX. This project was part of Google Summer of Code'25

Checklist

  • I have added a new feature and have added tests to go along with it.
  • I have fixed a bug and have added a regression test.
  • I have added a test using random numbers; I have made sure it uses a seed, and that random numbers generated are valid inputs for the tests.

This commit adds a basic poc implementation for thrust policy and algorithm dispatch based on it
@StellarBot
Copy link

Can one of the admins verify this patch?

@codacy-production
Copy link

codacy-production bot commented Jul 17, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.70%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3746bd5) 263487 227324 86.28%
Head commit (8e2b2a7) 216318 (-47169) 185106 (-42218) 85.57% (-0.70%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#6744) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@Pansysk75 Pansysk75 marked this pull request as ready for review August 20, 2025 10:25
@Pansysk75 Pansysk75 requested a review from hkaiser as a code owner August 20, 2025 10:25
Copy link
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this!

Add par_nosync execution policy support which allows async execution , leveraging exiting async_cuda infrastructure and add tag invoke policy branching
@adityacodes30
Copy link
Author

adityacodes30 commented Aug 20, 2025

Thank you for the review @hkaiser , i pushed another commit which is a lot more representative of current state and i have made a lot of changes , the previous commit was outdated and an intial proof of concept draft . Also i am still adding algorithms in algorithm_map along with cmake checks

I will incorporate the general feedback mentioned above in the forthcoming commit here but if you could look at the implementation logic and see if we could make any improvements here . I will include that as well

Also @Pansysk75 and i were pondering on weather we should move thrust implementations to its own directory from async_commit since it might be confusing as all of thrust isnt 'async' . Would love your opinion here

@hkaiser
Copy link
Member

hkaiser commented Aug 20, 2025

Also @Pansysk75 and i were pondering on weather we should move thrust implementations to its own directory from async_commit since it might be confusing as all of thrust isnt 'async' . Would love your opinion here

Yes, turning this into a separate HPX module (e.g., libs/core/thrust, or similar) would be appreciated. Most importantly this would avoid having to add an unnnedded module dependency to async_cuda. You can use the script create_module_skeleton.py to generate an initial skeleton for this.

@adityacodes30
Copy link
Author

thrust implementation depends on async_cuda for async thrust policy , since we use the target from there to get stream and future while dispatching with par_nosync

Copy link
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks!

@@ -0,0 +1,603 @@
// Copyright (c) 2025 Aditya Sapra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred for the policies to be derived from the HPX policy base class as this reduces the amount of code duplication considerably. For now, your code is fine. Please consider changing it if there is sufficient time left, however.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay let me make another branch with that approach and get it validated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityacodes30
Copy link
Author

adityacodes30 commented Aug 30, 2025

@hkaiser i made the requested changes and the refactored derive from base changes are on https://github.com/adityacodes30/hpx/tree/policyRefactoring . Let me know if they look fine and i will merge them into this branch. Also i had to add a guard in any_sender since i was getting errors with device compilation through nvcc , ( inline constexpr empty_vtable_t<T> empty_vtable{}; ) .

@adityacodes30 adityacodes30 requested a review from hkaiser August 30, 2025 22:50
@Pansysk75
Copy link
Member

@adityacodes30 Could you bring this branch up-to-speed with master? Also, could you take care of the clang_format and cmake_format issues? Thanks!

@adityacodes30
Copy link
Author

Sure !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants