diff --git a/README.md b/README.md index 0e38ddb..378353b 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,122 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Xuanyi Zhou + * [LinkedIn](https://www.linkedin.com/in/xuanyi-zhou-661365192/), [Github](https://github.com/lukedan) +* Tested on: Windows 10, i7-9750H @ 2.60GHz 32GB, RTX 2060 6GB -### (TODO: Your README) +## Features -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +- Regular CPU and thrust implementations of scan and compact operations. +- Naive implementation of the scan operation. +- Recursive efficient scan implementation that supports arbitrary array lengths, and makes use of shared memory & bank conflict optimizations. +- Efficient compact implementaiton making use of the efficient scan operation. +## Questions + +- Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU. + + For Naive scan, changing the block size does not affect executing time much as the operation is bounded by memory bandwidth. For efficient scan and compaction, the optimal block size is around 64/128. From tracing the program it can be seen that thrust also uses a block size of 128. + +- Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis). + + Below is the graph of run time of all implementations: + + ![](img/perf.png) + + See `data.xlsx` for the raw data. Note that since the efficient scan implementation is recursive, some memory allocations are included in the final time. + + For small array sizes, all methods are fairly fast and the run times are relatively unpredictable. As the array size increases, the relationship between the run times become consistent: the naive scan is the slowest, followed by the CPU implementation, the work-efficient scan, and finally the thrust implementation is the fastest. + + Looking at the Nsight timeline and kernel calls it can be seen that the number of threads spawned by the thrust implementation is far less than the work-efficient implementation which requires one thread for every two array elements. The thrust implementation also uses far more registers per thread. This suggests that the thrust implementation may be using an optimization mentioned in the GPU gems chapter, which is performing a scan operation for a few elements in each thread in serial, then aggregating the results by performing a scan on the sums. + + Interestingly, the time gaps between copying data from CPU to GPU and back are roughly the same for the work-efficient implementation and the thrust implementation as shown in the timeline, but the work-efficient implementation is busy throughout the period while the thrust implementation is largely idle for the first half of the gap. Currently I do not have a reasonable explanation for this. + +- Write a brief explanation of the phenomena you see here. + + Since the naive implementation does not use any shared memory, it's understandable that it would be slower than the CPU implementation as the GPU has a much lower clock speed than the CPU, and does not have nearly the same amount of cache. + + The thrust implementation and the work-efficient implementation are faster than the CPU version. The thrust version is about three times faster as it may have utilized other optimizations such as the one mentioned in the previous answer. + + The reason why the optimization mentioned before works may be that it reduces the number of processed elements by a large factor (4x if 4 elements are summed in serial) while summing a few numbers in serial is relatively cheap and is very well parallelized. + +- Paste the output of the test program into a triple-backtick block in your README. + + The tests for radix sort can be seen at the end of the output. + + ``` + **************** + ** SCAN TESTS ** + **************** + [ 45 4 11 40 10 5 35 48 33 44 0 28 24 ... 41 0 ] + ==== cpu scan, power-of-two ==== + elapsed time: 78.9076ms (std::chrono Measured) + [ 0 45 49 60 100 110 115 150 198 231 275 275 303 ... -1007647599 -1007647558 ] + ==== cpu scan, non-power-of-two ==== + elapsed time: 77.9638ms (std::chrono Measured) + [ 0 45 49 60 100 110 115 150 198 231 275 275 303 ... -1007647704 -1007647672 ] + passed + ==== naive scan, power-of-two ==== + elapsed time: 133.709ms (CUDA Measured) + passed + ==== naive scan, non-power-of-two ==== + elapsed time: 112.619ms (CUDA Measured) + passed + ==== work-efficient scan, power-of-two ==== + elapsed time: 11.0549ms (CUDA Measured) + passed + ==== work-efficient scan, non-power-of-two ==== + elapsed time: 11.0906ms (CUDA Measured) + passed + ==== thrust scan, power-of-two ==== + elapsed time: 4.04538ms (CUDA Measured) + passed + ==== thrust scan, non-power-of-two ==== + elapsed time: 4.0943ms (CUDA Measured) + passed + + ***************************** + ** STREAM COMPACTION TESTS ** + ***************************** + [ 3 0 2 2 1 0 0 2 2 3 2 3 1 ... 2 0 ] + ==== cpu compact without scan, power-of-two ==== + elapsed time: 307.968ms (std::chrono Measured) + [ 3 2 2 1 2 2 3 2 3 1 3 2 2 ... 2 2 ] + passed + ==== cpu compact without scan, non-power-of-two ==== + elapsed time: 304.929ms (std::chrono Measured) + [ 3 2 2 1 2 2 3 2 3 1 3 2 2 ... 2 2 ] + passed + ==== cpu compact with scan ==== + elapsed time: 422.665ms (std::chrono Measured) + [ 3 2 2 1 2 2 3 2 3 1 3 2 2 ... 2 2 ] + passed + ==== work-efficient compact, power-of-two ==== + elapsed time: 25.4493ms (CUDA Measured) + [ 3 2 2 1 2 2 3 2 3 1 3 2 2 ... 2 2 ] + passed + ==== work-efficient compact, non-power-of-two ==== + elapsed time: 23.5873ms (CUDA Measured) + [ 3 2 2 1 2 2 3 2 3 1 3 2 2 ... 2 2 ] + passed + ==== thrust compact, power-of-two ==== + elapsed time: 4.44602ms (CUDA Measured) + passed + ==== thrust compact, non-power-of-two ==== + elapsed time: 4.03315ms (CUDA Measured) + passed + + ********************* + ** RADIX SORT TEST ** + ********************* + [ 943102324 1728649027 1523795418 230368144 1853983028 219035492 1373487995 539655339 345004302 1682352720 528619710 1142157171 735013686 ... 646714987 484939495 ] + ==== radix sort, power-of-two ==== + elapsed time: 1305.98ms (CUDA Measured) + passed + [ 42 45 55 58 63 67 78 122 162 170 188 206 221 ... 1999999985 1999999998 ] + ==== radix sort, non-power-of-two ==== + elapsed time: 1338.5ms (CUDA Measured) + passed + [ 42 45 55 58 63 67 78 122 162 170 188 206 221 ... 1999999985 1999999998 ] + Press any key to continue . . . + ``` diff --git a/data.xlsx b/data.xlsx new file mode 100644 index 0000000..055bd1b Binary files /dev/null and b/data.xlsx differ diff --git a/img/perf.png b/img/perf.png new file mode 100644 index 0000000..4b8e53a Binary files /dev/null and b/img/perf.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..fd74aea 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -7,13 +7,14 @@ */ #include +#include #include #include #include #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 27; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -137,16 +138,62 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + printDesc("thrust compact, power-of-two"); + count = StreamCompaction::Thrust::compact(SIZE, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("thrust compact, non-power-of-two"); + count = StreamCompaction::Thrust::compact(NPOT, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + + printf("\n"); + printf("*********************\n"); + printf("** RADIX SORT TEST **\n"); + printf("*********************\n"); + + // here we write our own version of genArray that does not make use of grandpa's functions + // because on my machine RAND_MAX is 0x7FFF which means not all bits can be tested + std::default_random_engine rand(std::random_device{}()); + std::uniform_int_distribution dist(0, 2000000000); + for (std::size_t i = 0; i < SIZE; ++i) { + a[i] = dist(rand); + } + printArray(SIZE, a, true); + + printDesc("radix sort, power-of-two"); + std::memcpy(b, a, sizeof(int) * SIZE); + std::sort(b, b + SIZE); + StreamCompaction::Efficient::radix_sort(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpResult(SIZE, b, c); + printArray(SIZE, c, true); + + printDesc("radix sort, non-power-of-two"); + std::memcpy(b, a, sizeof(int) * NPOT); + std::sort(b, b + NPOT); + StreamCompaction::Efficient::radix_sort(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpResult(NPOT, b, c); + printArray(NPOT, c, true); + + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..8558538 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,11 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + if (iSelf >= n) { + return; + } + bools[iSelf] = idata[iSelf] != 0 ? 1 : 0; } /** @@ -32,7 +36,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + if (iSelf >= n) { + return; + } + + if (bools[iSelf] != 0) { + odata[indices[iSelf]] = idata[iSelf]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..40b95f3 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,13 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int total = 0; + for (int i = 0; i < n; ++i) { + odata[i] = total; + total += idata[i]; + } + timer().endCpuTimer(); } @@ -30,9 +36,17 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int count = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + odata[count] = idata[i]; + ++count; + } + } + timer().endCpuTimer(); - return -1; + return count; } /** @@ -42,9 +56,29 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // scan + int total = 0; + for (int i = 0; i < n; ++i) { + odata[i] = total; + total += (idata[i] == 0 ? 0 : 1); + } + + // scatter + int count = 0; + for (int i = 1; i < n; ++i) { + if (odata[i] != count) { + odata[count] = idata[i - 1]; + ++count; + } + } + if (idata[n - 1] != 0) { + odata[count] = idata[n - 1]; + ++count; + } + timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..3c5da31 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -1,26 +1,152 @@ +#include "efficient.h" + +#include + #include #include + #include "common.h" -#include "efficient.h" namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { + PerformanceTimer& timer() { static PerformanceTimer timer; return timer; } + // this many elements are processed by one block. cuda block size is half this + /*constexpr int log2BlockSize = 10; // 1024*/ + /*constexpr int log2BlockSize = 9; // 512*/ + /*constexpr int log2BlockSize = 8; // 256*/ + constexpr int log2BlockSize = 7; // 128 + /*constexpr int log2BlockSize = 6; // 64*/ + constexpr int blockSize = 1 << log2BlockSize; + + constexpr int log2BankSize = 5; + + __device__ int conflictFreeIndex(int i) { + return i + (i >> log2BankSize); + } + + __global__ void kernScanPerBlock(int *data, int *lastData) { + extern __shared__ int buffer[]; + data += blockIdx.x * blockDim.x * 2; + + int offset1 = conflictFreeIndex(threadIdx.x), offset2 = conflictFreeIndex(threadIdx.x + blockDim.x); + + // copy data to shared memory + buffer[offset1] = data[threadIdx.x]; + buffer[offset2] = data[threadIdx.x + blockDim.x]; + + int lastElem = 0; + if (lastData && threadIdx.x == blockDim.x - 1) { + lastElem = buffer[offset2]; + } + __syncthreads(); + + // upward pass + for (int halfGap = 1; halfGap < blockDim.x; halfGap <<= 1) { + if (threadIdx.x < blockDim.x / halfGap) { + int + id1 = conflictFreeIndex((threadIdx.x * 2 + 1) * halfGap - 1), + id2 = conflictFreeIndex((threadIdx.x * 2 + 2) * halfGap - 1); + buffer[id2] += buffer[id1]; + } + __syncthreads(); + } + + if (threadIdx.x == blockDim.x - 1) { + buffer[conflictFreeIndex(blockDim.x * 2 - 1)] = buffer[offset1]; + buffer[offset1] = 0; + } + __syncthreads(); + + // downward pass + for (int halfGap = blockDim.x >> 1; halfGap >= 1; halfGap >>= 1) { + if (threadIdx.x < blockDim.x / halfGap) { + int prevIdx = (threadIdx.x * 2 + 1) * halfGap - 1; + int thisIdx = prevIdx + halfGap; + prevIdx = conflictFreeIndex(prevIdx); + thisIdx = conflictFreeIndex(thisIdx); + int sum = buffer[thisIdx] + buffer[prevIdx]; + buffer[prevIdx] = buffer[thisIdx]; + buffer[thisIdx] = sum; + } + __syncthreads(); + } + + // copy data back + data[threadIdx.x] = buffer[offset1]; + data[threadIdx.x + blockDim.x] = buffer[offset2]; + if (lastData && threadIdx.x == blockDim.x - 1) { + lastData[blockIdx.x] = lastElem + buffer[offset2]; + } + } + + __global__ void kernAddConstantToBlock(int *data, const int *amount) { + data[blockIdx.x * blockDim.x + threadIdx.x] += amount[blockIdx.x]; + } + + void _computeSizes(int n, int log2BlockSize, int *numBlocks, int *bufferSize) { + *numBlocks = n >> log2BlockSize; + if ((n & ((1 << log2BlockSize) - 1)) != 0) { + ++*numBlocks; + } + *bufferSize = *numBlocks << log2BlockSize; + } + + void dev_scan(int n, int *dev_data) { + assert((n & (blockSize - 1)) == 0); + + if (n > blockSize) { + int numBlocks = n >> log2BlockSize, numIndirectBlocks, indirectSize; + _computeSizes(numBlocks, log2BlockSize, &numIndirectBlocks, &indirectSize); + + int *buffer; + cudaMalloc(&buffer, sizeof(int) * indirectSize); + + kernScanPerBlock<<< + numBlocks, blockSize / 2, (blockSize + (blockSize >> log2BankSize)) * sizeof(int) + >>>(dev_data, buffer); + dev_scan(indirectSize, buffer); + kernAddConstantToBlock<<>>(dev_data, buffer); + + cudaFree(buffer); + } else { // just scan the block + kernScanPerBlock<<< + 1, blockSize / 2, (blockSize + (blockSize >> log2BankSize)) * sizeof(int) + >>>(dev_data, nullptr); + } + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int numBlocks, bufferSize; + _computeSizes(n, log2BlockSize, &numBlocks, &bufferSize); + + int *buffer; + cudaMalloc(&buffer, sizeof(int) * bufferSize); + + cudaMemcpy(buffer, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + // if integer overflow on the GPU were well-defined we would be able to get away without zeroing the rest + cudaMemset(buffer + n, 0, sizeof(int) * (bufferSize - n)); + timer().startGpuTimer(); - // TODO + dev_scan(bufferSize, buffer); timer().endGpuTimer(); + + odata[0] = 0; + cudaMemcpy(odata, buffer, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(buffer); + checkCUDAError("efficient scan"); } + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +157,107 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + constexpr int log2ScatterBlockSize = 6; + constexpr int scatterBlockSize = 1 << log2ScatterBlockSize; + + int numBlocks, bufferSize; + _computeSizes(n, log2BlockSize, &numBlocks, &bufferSize); + int numScatterBlocks = (n + scatterBlockSize - 1) >> log2ScatterBlockSize; + + int *data, *accum, *out; + cudaMalloc(&data, sizeof(int) * bufferSize); + cudaMalloc(&accum, sizeof(int) * bufferSize); + cudaMalloc(&out, sizeof(int) * bufferSize); + + cudaMemcpy(data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + Common::kernMapToBoolean<<>>(n, accum, data); + dev_scan(bufferSize, accum); + Common::kernScatter<<>>(n, out, data, data, accum); timer().endGpuTimer(); - return -1; + + int last = idata[n - 1] != 0 ? 1 : 0, res; + cudaMemcpy(&res, accum + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + res += last; + cudaMemcpy(odata, out, sizeof(int) * res, cudaMemcpyDeviceToHost); + checkCUDAError("efficient compaction"); + + cudaFree(data); + cudaFree(accum); + cudaFree(out); + + return res; + } + + + __global__ void kernExtractBit(int n, int bit, int *odata, const int *idata) { + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + if (iSelf >= n) { + return; + } + odata[iSelf] = (idata[iSelf] & (1 << bit)) != 0 ? 1 : 0; + } + + __global__ void kernNegate(int *odata, const int *idata) { + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + odata[iSelf] = idata[iSelf] == 0 ? 1 : 0; + } + + __global__ void kernRadixSortScatter( + int n, int numFalses, int bit, int *odata, const int *idata, const int *trues, const int *falses + ) { + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + if (iSelf >= n) { + return; + } + int value = idata[iSelf], index; + if ((value & (1 << bit)) != 0) { + index = trues[iSelf] + numFalses; + } else { + index = falses[iSelf]; + } + odata[index] = value; + } + + void radix_sort(int n, int *odata, const int *idata) { + constexpr int numIntBits = sizeof(int) * 8 - 1; + + int numBlocks, bufferSize; + _computeSizes(n, log2BlockSize, &numBlocks, &bufferSize); + + int *data1, *data2, *trues, *falses; + cudaMalloc(&data1, sizeof(int) * n); + cudaMalloc(&data2, sizeof(int) * n); + cudaMalloc(&trues, sizeof(int) * bufferSize); + cudaMalloc(&falses, sizeof(int) * bufferSize); + + cudaMemcpy(data1, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + for (int i = 0; i < numIntBits; ++i) { + kernExtractBit<<>>(n, i, trues, data1); + kernNegate<<>>(falses, trues); + dev_scan(bufferSize, trues); + dev_scan(bufferSize, falses); + int numFalses, lastElem; + cudaMemcpy(&lastElem, data1 + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&numFalses, falses + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + if ((lastElem & (1 << i)) == 0) { + ++numFalses; + } + kernRadixSortScatter<<>>(n, numFalses, i, data2, data1, trues, falses); + std::swap(data1, data2); + } + timer().endGpuTimer(); + + cudaMemcpy(odata, data1, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(data1); + cudaFree(data2); + cudaFree(trues); + cudaFree(falses); + checkCUDAError("radix sort"); } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..f7706d8 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,8 +6,13 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); + /// Scans a device array. The input size must be a multiple of 512. + void dev_scan(int n, int *dev_data); + void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + + void radix_sort(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..3286b33 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -11,15 +11,49 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + constexpr int log2BlockSize = 7; + constexpr int block_size = 1 << log2BlockSize; + + __global__ void kernScanPass(int n, int diff, int *odata, const int *idata) { + int iSelf = blockIdx.x * blockDim.x + threadIdx.x; + if (iSelf > n) { + return; + } + if (iSelf >= diff) { + odata[iSelf] = idata[iSelf] + idata[iSelf - diff]; + } else { + odata[iSelf] = idata[iSelf]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + int *buf1 = nullptr, *buf2 = nullptr; + cudaMalloc(&buf1, sizeof(int) * n); + cudaMalloc(&buf2, sizeof(int) * n); + + cudaMemcpy(buf1, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + { + timer().startGpuTimer(); + + int num_blocks = (n + block_size - 1) / block_size; + for (int diff = 1; diff < n; diff *= 2) { + kernScanPass<<>>(n, diff, buf2, buf1); + std::swap(buf1, buf2); + } + + timer().endGpuTimer(); + } + + odata[0] = 0; + cudaMemcpy(odata + 1, buf1, sizeof(int) * (n - 1), cudaMemcpyDeviceToHost); + + cudaFree(buf1); + cudaFree(buf2); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..2769bf5 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,35 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::device_vector dev_vec(idata, idata + n), dev_out(n); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` + // use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_vec.begin(), dev_vec.end(), dev_out.begin()); timer().endGpuTimer(); + + thrust::copy(dev_out.begin(), dev_out.end(), odata); + } + + struct isNonZero { + __host__ __device__ bool operator()(int x) { + return x != 0; + } + }; + + int compact(int n, int *odata, const int *idata) { + thrust::device_vector dev_vec(idata, idata + n), dev_out(n); + + timer().startGpuTimer(); + int count = thrust::copy_if( + dev_vec.begin(), dev_vec.end(), dev_out.begin(), isNonZero() + ) - dev_out.begin(); + timer().endGpuTimer(); + + thrust::copy(dev_out.begin(), dev_out.end(), odata); + return count; } } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index fe98206..a8991f9 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -7,5 +7,7 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + + int compact(int n, int *odata, const int *idata); } }