diff --git a/README.md b/README.md index 0e38ddb..fc2836f 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,129 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Name: Gizem Dal + * [LinkedIn](https://www.linkedin.com/in/gizemdal), [personal website](https://www.gizemdal.com/) +* Tested on: Predator G3-571 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz 2.81 GHz - Personal computer (borrowed my friend's computer for the semester) -### (TODO: Your README) +**Project Description** -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +The main focus of this project is implementing GPU stream compaction and other parallel algorithms in CUDA which are widely used and important for accelerating path tracers and algorithmic thinking. I implemented a few different versions of the Scan (Prefix Sum) algorithm including CPU scan, naive GPU scan, work-efficient GPU scan and GPU Thrust library scan. Then, I used some of these scan implementations to implement stream compaction for CPU and GPU. All of these implementations are timed in order to show runtime comparisons between different approaches and do a comprehensive performance analysis. +**Project Tasks** + +* **CPU Algorithms** + * CPU Scan + * CPU Stream Compaction (without Scan) + * CPU Stream Compaction (with Scan) +* **GPU Algorithms** + * Naive Scan + * Work-Efficient Scan + * Work-Efficient Stream Compaction + * Thrust Scan + +**Performance Analysis** + +Both CPU and GPU timing functions are wrapped up as a performance timer class in order to measure the time cost conveniently. We use *std::chrono* to provide CPU high-precision timing and CUDA event to measure the CUDA performance. In order to measure the performance of the implementation itself, I **have not** included the cost of initial and final memory operations on the GPU such as cudaMalloc() or cudaMemcpy() while timing the runtime. + +As the first step of the performance analysis, let's compare the performance of different scan implementations. In order to achieve this, I passed different block sizes to be used on the GPU while keeping the input size constant. I defined one input array with randomly generated integers except I read two different amounts of items: 2^20 (thus an exact power of 2) and 2^20 - 3 (thus 3 integers less). I will refer to the first size as **POT (Power Of Two)** and the second size as **NPOT (Non-Power Of Two)**. + +![Block size versus Runtime](img/optimal_blocksize.png) +*Timed performances of CPU, GPU naive, GPU work-efficient & GPU thrust scan functions with different block sizes* + +As expected, changing the block size doesn't have a notable impact on the CPU scan. However, there is a significant performance gap between CPU scan calls with POT and NPOT. Since I'm running the CPU scan with POT before the CPU scan with NPOT, it's possible that the array could be stored in cache after the first test, which would result in making the second test unintentionally run faster. + +Using a block size of 32 makes the naive and work efficient GPU scan functions significantly inefficient with large inputs. Block sizes greater than or equal to 64 have fluctuating time costs for the naive GPU scan for both POT and NPOT. With the current array size, using a block size of 512 gives the most optimal outcome for this particular scan implementation. Although we don't have the same time cost for POT and NPOT inputs at block sizes 64, 128, 256 and 1024, the difference is somewhat negligible. If we observe the results from the work-efficient implementation, we can say that using 128 blocks is the most optimal with the current input size. Once we hit the optimal result at blockSize=128, increasing the block size to be greater than 128 results in inefficiency. With the optimal block size, this implementation runs faster than CPU and naive GPU scan with both POT or NPOT inputs. + +The GPU thrust scan performs the fastest by a significant difference with very little fluctuation between different block sizes. It's possible that thrust scan could be using some methods to reduce memory latency. Increasing the block size doesn't seem to have a notable impact on thrust scan performance, in fact it can sometimes result in a very slightly less efficient result with POT inputs. + +As the next step of the analysis, I measured the performance of all the scan and stream compaction implementations with different input sizes while maintaining the block size at 128. + +**Array Size versus Scan Measured Performance** + +![Array size vs Scan Performance](img/scangraph.png) + +**Array Size versus Stream Compaction Measured Performance** + +![Array size vs Compaction Performance](img/compactiongraph.png) + +I also measured the performances with very large input sizes (greater than 1 million). Since it is difficult to represent this data with a graph, I have included a table below for both scan and stream compaction performances. + +**Array Size versus Scan Measured Performance** + +Method | CPU | Naive | Work Efficient | Thrust +:---: | :---: | :---: | :---: | :---: +Array size = 2^20 | 5.2834 ms | 1.813 ms | 1.628 ms | 0.26 ms +Array size = 2^25 | 152.3 ms | 75.14 ms | 50 ms | 2.18 ms + +**Array Size versus Stream Compaction Measured Performance** + +Method | CPU (with scan) | Work Efficient +:---: | :---: | :---: +Array size = 2^20 | 8.75 ms | 1.85 ms +Array size = 2^25 | 256.065 ms | 53.5 ms + +CPU implementations works the fastest with small inputs, however their runtimes scales up very quickly once the input size gets closer to 1 million. We don't observe the work-efficient and Thrust scan implementations to be faster than naive implementation until we hit much larger input sizes where we benefit more from parallelism. Although the work-efficient approach runs significantly faster than the naive approach once the 1 million mark is hit, it still gets inefficient very quickly while the thrust implementation remains significantly fast in comparison. This could be due to a bottleneck in work-efficient implementation memory I/O which could be resolved by using shared instead of global memory. + +To provide an insight on how the performance is measured, I included a sample performance test outputs below. These tests use a block size of 128 and input size of 2^18. + +``` +**************** +** SCAN TESTS ** +**************** + [ 3 28 39 7 17 30 13 2 29 5 29 6 44 ... 1 0 ] +==== cpu scan, power-of-two (First run) ==== + elapsed time: 1.1686ms (std::chrono Measured) + [ 0 3 31 70 77 94 124 137 139 168 173 202 208 ... 6421970 6421971 ] +==== cpu scan, power-of-two (Second run) ==== + elapsed time: 1.1714ms (std::chrono Measured) + [ 0 3 31 70 77 94 124 137 139 168 173 202 208 ... 6421970 6421971 ] +==== cpu scan, non-power-of-two (First run) ==== + elapsed time: 0.4859ms (std::chrono Measured) + [ 0 3 31 70 77 94 124 137 139 168 173 202 208 ... 6421921 6421944 ] + passed +==== cpu scan, non-power-of-two (Second run) ==== + elapsed time: 0.4544ms (std::chrono Measured) + [ 0 3 31 70 77 94 124 137 139 168 173 202 208 ... 6421921 6421944 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.415968ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.421344ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.527264ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.503168ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.306752ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.299136ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 2 1 3 1 2 1 0 1 1 1 0 2 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.8395ms (std::chrono Measured) + [ 1 2 1 3 1 2 1 1 1 1 2 3 2 ... 1 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.7936ms (std::chrono Measured) + [ 1 2 1 3 1 2 1 1 1 1 2 3 2 ... 1 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 2.1656ms (std::chrono Measured) + [ 1 2 1 3 1 2 1 1 1 1 2 3 2 ... 1 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.43952ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.443168ms (CUDA Measured) + passed +``` diff --git a/img/arraysizeNPOT.png b/img/arraysizeNPOT.png new file mode 100644 index 0000000..53fd7ef Binary files /dev/null and b/img/arraysizeNPOT.png differ diff --git a/img/arraysizePOT2.png b/img/arraysizePOT2.png new file mode 100644 index 0000000..3bd46d1 Binary files /dev/null and b/img/arraysizePOT2.png differ diff --git a/img/compactiongraph.png b/img/compactiongraph.png new file mode 100644 index 0000000..6f4b051 Binary files /dev/null and b/img/compactiongraph.png differ diff --git a/img/optimal_blocksize.png b/img/optimal_blocksize.png new file mode 100644 index 0000000..7fa883e Binary files /dev/null and b/img/optimal_blocksize.png differ diff --git a/img/scangraph.png b/img/scangraph.png new file mode 100644 index 0000000..77d1261 Binary files /dev/null and b/img/scangraph.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..a33136f 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 18; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -35,13 +35,26 @@ int main(int argc, char* argv[]) { // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. // At first all cases passed because b && c are all zeroes. zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); + printDesc("cpu scan, power-of-two (First run)"); StreamCompaction::CPU::scan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(SIZE, b, true); + zeroArray(SIZE, b); + printDesc("cpu scan, power-of-two (Second run)"); + StreamCompaction::CPU::scan(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("cpu scan, non-power-of-two (First run)"); + StreamCompaction::CPU::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(NPOT, b, true); + printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); + printDesc("cpu scan, non-power-of-two (Second run)"); StreamCompaction::CPU::scan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(NPOT, b, true); @@ -54,11 +67,11 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan + /*For bug-finding only: Array of 1s to help find bugs in stream compaction or scan onesArray(SIZE, c); printDesc("1s array for finding bugs"); StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ + printArray(SIZE, c, true);*/ zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..7b4c201 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,13 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (idata[index] == 0) bools[index] = 0; + else { + bools[index] = 1; + } } /** @@ -32,7 +38,10 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (bools[index] == 1) odata[indices[index]] = idata[index]; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..75dacf9 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,10 +19,34 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + for (int i = 0; i < n; ++i) { + int prefix_idx = i - 1; + if (prefix_idx < 0) { + odata[i] = 0; + } + else { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } timer().endCpuTimer(); } + /** + * CPU scan (prefix sum) as a helper method. + * For performance analysis, this is supposed to be a simple for loop. + */ + void scanImplementation(int n, int* odata, const int* idata) { + for (int i = 0; i < n; ++i) { + int prefix_idx = i - 1; + if (prefix_idx < 0) { + odata[i] = 0; + } + else { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } + } + /** * CPU stream compaction without using the scan function. * @@ -30,9 +54,16 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int count = 0; + for (int i = 0; i < n; ++i) { + int elem = idata[i]; + if (elem != 0) { + odata[count] = elem; + count++; + } + } timer().endCpuTimer(); - return -1; + return count; } /** @@ -42,9 +73,32 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int* temp = new int[n]; // create temp array + // fill temp array with 0s and 1s + for (int i = 0; i < n; ++i) { + int elem = idata[i]; + if (elem != 0) { + temp[i] = 1; + } + else { + temp[i] = 0; + } + } + // run scan + int* scanned = new int[n] {0}; + StreamCompaction::CPU::scanImplementation(n, scanned, temp); + // scatter + int count = 0; + for (int i = 0; i < n; ++i) { + if (temp[i] == 1) { + odata[scanned[i]] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; + delete[] scanned; + delete[] temp; + return count; } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 873c047..1b913c9 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -8,6 +8,8 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scanImplementation(int n, int* odata, const int* idata); + int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..a7a3e6e 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -1,7 +1,11 @@ #include #include +#include #include "common.h" #include "efficient.h" +#include "thrust.h" + +#define blockSize 128 namespace StreamCompaction { namespace Efficient { @@ -12,13 +16,69 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpSweep(int n, int d, int* in, int pow2_d, int pow2_d1) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (index % pow2_d1 == 0) { + in[index + pow2_d1 - 1] += in[index + pow2_d - 1]; + } + } + + __global__ void kernDownSweep(int n, int d, int* in, int pow2_d, int pow2_d1) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (index % pow2_d1 == 0) { + int left = in[index + pow2_d - 1]; + in[index + pow2_d - 1] = in[index + pow2_d1 - 1]; + in[index + pow2_d1 - 1] += left; + } + } + + /** + * Helper method to calculate the distance from the nearest power of 2 greater than or equal to n + */ + int distanceFromPowTwo(int n) { + int pos = ceil(log2(n)); + return int(1 << pos) - n; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* input; + int numItems = n; + int zerosToPad = distanceFromPowTwo(n); + if (zerosToPad == 0) { + cudaMalloc((void**)&input, numItems * sizeof(int)); + cudaMemcpy(input, idata, sizeof(int) * numItems, cudaMemcpyHostToDevice); + } + else { + numItems += zerosToPad; + cudaMalloc((void**)&input, numItems * sizeof(int)); + cudaMemcpy(input + zerosToPad, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemset(input, 0, zerosToPad * sizeof(int)); + } + dim3 fullBlocksPerGrid((numItems + blockSize - 1) / blockSize); timer().startGpuTimer(); - // TODO + // up sweep + for (int d = 0; d <= ilog2ceil(numItems) - 1; ++d) { + int pow2_d = 1 << d; + int pow2_d1 = 1 << (d + 1); + kernUpSweep << > > (numItems, d, input, pow2_d, pow2_d1); + } + // down sweep + cudaMemset(input + numItems - 1, 0, sizeof(int)); + for (int d = ilog2ceil(numItems) - 1; d >= 0; --d) { + int pow2_d = 1 << d; + int pow2_d1 = 1 << (d + 1); + kernDownSweep << > > (numItems, d, input, pow2_d, pow2_d1); + } timer().endGpuTimer(); + cudaMemcpy(odata, input + zerosToPad, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(input); } /** @@ -31,10 +91,50 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int* input; // input array + int* output; // output array + int* bools; // boolean array + int* scanned; // scan result + int zerosToPad = distanceFromPowTwo(n); + int numItems = n + zerosToPad; + cudaMalloc((void**)&input, numItems * sizeof(int)); + cudaMemcpy(input + zerosToPad, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemset(input, 0, zerosToPad * sizeof(int)); + cudaMalloc((void**)&output, numItems * sizeof(int)); + cudaMalloc((void**)&bools, numItems * sizeof(int)); + cudaMalloc((void**)&scanned, numItems * sizeof(int)); + + dim3 fullBlocksPerGrid((numItems + blockSize - 1) / blockSize); + timer().startGpuTimer(); - // TODO + StreamCompaction::Common::kernMapToBoolean << > > (numItems, bools, input); + // do scan here + cudaMemcpy(scanned, bools, sizeof(int) * numItems, cudaMemcpyDeviceToDevice); + for (int d = 0; d <= ilog2ceil(numItems) - 1; ++d) { + int pow2_d = 1 << d; + int pow2_d1 = 1 << (d + 1); + kernUpSweep << > > (numItems, d, scanned, pow2_d, pow2_d1); + } + // down sweep + cudaMemset(scanned + numItems - 1, 0, sizeof(int)); + for (int d = ilog2ceil(numItems) - 1; d >= 0; --d) { + int pow2_d = 1 << d; + int pow2_d1 = 1 << (d + 1); + kernDownSweep << > > (numItems, d, scanned, pow2_d, pow2_d1); + } + StreamCompaction::Common::kernScatter << < fullBlocksPerGrid, blockSize >> > (numItems, output, input, bools, scanned); timer().endGpuTimer(); - return -1; + cudaMemcpy(odata, output, sizeof(int) * n, cudaMemcpyDeviceToHost); + int result = 0; + for (int i = 0; i < n; ++i) { + if (odata[i] == 0) continue; + result++; + } + cudaFree(input); + cudaFree(output); + cudaFree(bools); + cudaFree(scanned); + return result; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..cd68a4b 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,6 +6,8 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); + int distanceFromPowTwo(int n); + void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..8b57f89 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define blockSize 128 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +13,57 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernNaiveScan(int n, int d, int* in1, int* in2, int* out, int pow2_d) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (index >= pow2_d) { + out[index] = in1[index - pow2_d] + in1[index]; + } + in2[index] = out[index]; + } + + __global__ void kernShiftArray(int n, int* in1, int* in2) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) return; + if (index == 0) in2[index] = 0; + else { + in2[index] = in1[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + // first allocate buffers and define kernel parameters + int* input; + int* input_temp; + int* output; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + cudaMalloc((void**)&input, n * sizeof(int)); + cudaMalloc((void**)&input_temp, n * sizeof(int)); + cudaMalloc((void**)&output, n * sizeof(int)); + cudaMemcpy(input, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemcpy(input_temp, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemcpy(output, odata, sizeof(int) * n, cudaMemcpyHostToDevice); timer().startGpuTimer(); - // TODO + // append identity to beginning and shift array + kernShiftArray <<< fullBlocksPerGrid, blockSize >> > (n, input, input_temp); + std::swap(input_temp, input); + // make ilog2ceil(n) kernel calls for scan + for (int d = 1; d <= ilog2ceil(n); ++d) { + int pow2 = 1 << (d - 1); + kernNaiveScan <<< fullBlocksPerGrid, blockSize >> > (n, d, input, input_temp, output, pow2); + std::swap(input, input_temp); + } timer().endGpuTimer(); + cudaMemcpy(odata, input, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(input); + cudaFree(input_temp); + cudaFree(output); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..f11bb95 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,14 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::host_vector host_in(idata, idata + n); + thrust::host_vector host_out(odata, odata + n); + thrust::device_vector dev_in = host_in; + thrust::device_vector dev_out = host_out; timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin()); timer().endGpuTimer(); + thrust::copy(dev_out.begin(), dev_out.end(), odata); } } }