diff --git a/README.md b/README.md index 0e38ddb..765bf7a 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,239 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +## Stream Compaction and Scan Algorithms with CUDA on the GPU +### Connie Chang + * [LinkedIn](https://www.linkedin.com/in/conniechang44), [Demo Reel](https://www.vimeo.com/ConChang/DemoReel) +* Tested on: Windows 10, Intel Xeon CPU E5-1630 v4 @ 3.70 GHz, GTX 1070 8GB (SIG Lab) -### (TODO: Your README) +## Introduction +This project explores different algorithms for stream compaction and exclusive scan, implemented in parallel with CUDA. We compare their performances to see which one is the best. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +![](img/figure-39-2.jpg) +An image from _GPU Gems_, illustrating the naive scan algorithm. +__Exclusive Scan__ +Exclusive scan takes an array of integers as input and outputs another array of integers of the same length. Each element of the output is the sum of all previous elements of the input, excluding the current element. (For example, output[i] = sum(input[0] + ... + input[i - 1]). + +__Stream Compaction__ +Stream compaction takes an array of integers as input and removes all the zeroes in it. Therefore, it outputs another array of integers that is of the same length or shorter. (For example, an input of [1, 5, 0, 3, 6, 0, 9] would result in [1, 5, 3, 6, 9]) + +## CPU Implementation +The CPU implementation is the most basic of the implementations in this project. The results from here become the baseline for comparison with later methods. + +__Exclusive Scan__ +On the CPU, exclusive scan is a simple FOR loop summing up all the elements in the input array. The pseudocode is as follows: +``` +sum = 0 +for (i = 0 ... (n - 1)) // n is size of input + output[i] = sum + sum += input[i] +``` + + __Stream Compaction__ +Likewise, stream compaction loops through all elements of the input array. If the element is not a zero, add it to the end of the output. +``` +count = 0 +for (i = 0 ... (n - 1)) // n is size of input + if input[i] != 0 + output[count] = input[i] + count++ +``` + +__Stream Compaction using Scan__ +This implementation is more interesting because it uses two other algorithms (scan and scatter) to complete stream compaction. First, the input is mapped to an array of 0s and 1s. If the input is a 0, then the map contains a 0 at the element. Otherwise, it is a 1. For example, +``` +input = [1, 5, 0, 3, 6, 0, 9] +``` +maps to +``` +01 map = [1, 1, 0, 1, 1, 0, 1] +``` +Next, the 01 map is scanned using an exclusive scan. The result of the scan gives the index of where each non-zero element goes in the final output. For our example, +``` +scan of 01 map = [0, 1, 2, 2, 3, 4, 4] +``` +Finally, to achieve the final result, we loop through the 01 map. If the current position contains a 1, we know the element in the input array at that position should be in the output. The find what index it should be in for the output, we look at the scan of 01 map. The number at the current position is the final index. Then, we just copy the input element to the corresponding index in the output. + +Here's the pseudocode of the entire algorithm: +``` +// Map input to 01 map +for (i = 0 ... (n - 1)) // n is length of input + if (input[i] == 0) + 01map[i] = 0 + else + 01map[i] = 1 + +// Run exclusive scan on 01map +scan = exclusive scan(01map) + +for (i = 0 ... (n - 1)) + if 01map[i] == 1 + output[scan[i - 1]] = input[i] +``` + +## GPU Naive Algorithms +The naive algorithm on the GPU has to take advantage of the GPU's parallelism. The algorithm starts by replacing the current element with the sum of the current element and the one before it. Then, the current element is replaced by the sum of it and the integer 2 positions before it. Then, it's the sum of itself and the integer 2^2=4 positions before. Then, itself and 2^3=8 positions before. And so on, until 2^(k) exceeds the size of the array. The image below, from *GPU Gems,* illustrates this method: + +![](img/figure-39-2.jpg) + +Note that this creates an inclusive scan. We need to convert it to an exclusive scan by shifting every element to the right and adding a zero to the beginning. + +The pseudocode for the CPU side is: +``` +for (k = 1 ... ilog2(n)) // n is size of input + Launch kernel, passing in k +``` + +The pseudocode for the device kernel is: +``` +if (index >= 2^k) + output[index] = input[index - 2^k] + input[index] +else + output[index] = input[index] +``` + +## GPU Work-Efficient Algorithms +The work efficient algorithm is somewhat similar to naive but with some improvements (maybe? See Performance Analysis below). The algorithm has two steps: Up-Sweep and Down-Sweep. Up-Sweep starts by going to every other element and summing itself with the previous element. It replaces the current element with the sum. Then, it does the same for the new integers. And so on, until there is only one new sum. In a sense, you are building up a tree. The image below, taken from CIS 565 slides, demonstrates this: +![](img/work_efficient_up_sweep.PNG) + +The Down-Sweep is more complicated. First, the last element is set to 0. Then, each element passes its value to its left child in the tree. The right child is the sum of the current value and the left child's previous value. This process is repeated as we work down the tree. The image, also taken from CIS 565 slides, illustrates this: +![](img/work_efficient_down_sweep.PNG) + +The pseudocode is as follows: +``` +// Up Sweep +for k = 0 ... ilogceil(n) // n is size of input + Launch Up Sweep kernel + +Set last element of Up Sweep result to 0 + +// Down Sweep +for k = (ilogceil(n) - 1) ... 0 + Launch Down Sweep kernel +``` + +Pseudocode for Up Sweep kernel: +``` +if (index % 2^(k+1) == 0) + data[index + 2^(k+1) - 1] += data[index + 2^k - 1] +``` + +Pseudocode for Down Sweep kernel: +``` +if (index % 2^(k+1) == 0) + temp = data[index + 2^k - 1] + data[index + 2^k - 1] = data[index + 2^(k+1) - 1] // Set left child to current element + data[index + 2^(k+1) - 1] += temp // Set current element to sum of itself and left child +``` + +## GPU Thrust Function +For performance comparison, I also ran thrust's implementation of exclusive scan. + +## Example Output from the Program +Here is an example of what the program outputs to the terminal. This run was done on an array size of 2^8. The tests were given in class, but I copied the Thrust tests to make each of them run twice. +``` +**************** +** SCAN TESTS ** +**************** + [ 4 4 39 19 26 38 40 32 19 14 0 28 20 ... 2 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0005ms (std::chrono Measured) + [ 0 4 8 47 66 92 130 170 202 221 235 235 263 ... 5837 5839 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0002ms (std::chrono Measured) + [ 0 4 8 47 66 92 130 170 202 221 235 235 263 ... 5785 5792 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.058368ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.05632ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.149504ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.156672ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 4.34893ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.048128ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.063488ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.053248ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 2 3 1 0 0 2 2 3 2 0 2 0 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0008ms (std::chrono Measured) + [ 2 3 1 2 2 3 2 2 2 3 3 1 2 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0009ms (std::chrono Measured) + [ 2 3 1 2 2 3 2 2 2 3 3 1 2 ... 1 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0036ms (std::chrono Measured) + [ 2 3 1 2 2 3 2 2 2 3 3 1 2 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.4608ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.723968ms (CUDA Measured) + passed +Press any key to continue . . . +``` + +## Performance Analysis +The tables and graphs below show how long each algorithm took (in milliseconds) to scan/compact a given array size. To find the ideal block size, I ran each algorithm using 128, 256, and 512 blocks, and used the fastest time for each. The times are measured using a CPU timer and GPU timer. None of memory allocation is included in the timings. All the arrays in the analysis have sizes that are powers of two. However, the algorithms work on non-power-of-two sizes as well. Once array sizes reach the order of 2^19, cudaMalloc failes. Presumably because there are too many intermediate arrays of large lengths and there is not enough memory. + +__Exclusive Scan Table__ + +| Array Size | CPU Scan | Naive Scan (256 blocks) | Work-efficient Scan (128 blocks) | Thrust scan | +| ------------- | ------------- | ----- | ----- | ----- | +| 2^4 | 0.0002 | 0.050176 | 0.156672 | 0.050176 | +| 2^8 | 0.0006 | 0.057344 | 0.159744 | 0.050176 | +| 2^10 | 0.0012 | 0.064512 | 0.187392 | 0.066560 | +| 2^15 | 0.0325 | 0.130912 | 0.309696 | 0.260096 | +| 2^17 | 0.0985 | 0.435616 | 0.663552 | 0.339968 | +| 2^18 | 1.1116 | 0.820160 | 1.442270 | 0.589824 | + +__Exclusive Scan Graph__ +![](img/scan_performance.png) + +__Stream Compaction Table__ + +| Array Size | CPU Compact without scan | CPU Compact with scan | Work-efficient compact (256 blocks) | +| ----- | ----- | ----- | ----- | +| 2^4 | 0.0002 | 0.0013 | 0.351232 | +| 2^8 | 0.0008 | 0.0038 | 0.369664 | +| 2^10 | 0.0025 | 0.0076 | 0.614400 | +| 2^15 | 0.0752 | 0.1144 | 0.759808 | +| 2^17 | 0.3076 | 0.7340 | 2.308100 | +| 2^18 | 0.6233 | 2.4728 | 3.639300 | + +__Stream Compaction Graph__ +![](img/compaction_performance.png) + + +__Analysis Questions__ + +_Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?_ + +My guess is that memory I/O is a bottleneck for the work efficient methods. This is because of the step between Up Sweep and Down Sweep of scan. I need to copy data from the GPU to the CPU just to set the last element to zero. Then I copy this back to the GPU from the CPU. This memory transfer is extremely slow because it is between hardware. Likewise, stream compaction has extra memory copies between host and device. In fact, it copies data from GPU to CPU in order to pass the data to the CPU scan function, only to have the scan function copy this data back to the GPU before computation begins. This is a very inefficient step. This could be solved by copying the scan code into the compact function. Then, altering the scan code to work directly with the device arrays already in compact. + +For naive, the bottleneck is probably at the end, where I convert the inclusive scan result to exclusive scan. I do this sequentially in the CPU, looping through the entire array, and shifting their values. This would be much more efficient on the GPU. + +_What might be happening inside the Thrust implementation?_ + +The first time Thrust runs exclusive scan, it takes a significant amount of time. For example, in the output above, it took 4.34893 ms. But the second time only took 0.048128 ms. Thrust must be instantiating something on the first run that it needs for exclusive scan. Those objects persist in memory throughout the lifetime of the program, so the next calls to thrust::exclusive_scan do not need to instantiate them again and can run faster. diff --git a/builds/cis565_stream_compaction_test.VC.db b/builds/cis565_stream_compaction_test.VC.db new file mode 100644 index 0000000..f931000 Binary files /dev/null and b/builds/cis565_stream_compaction_test.VC.db differ diff --git a/img/compaction_performance.png b/img/compaction_performance.png new file mode 100644 index 0000000..5f248e9 Binary files /dev/null and b/img/compaction_performance.png differ diff --git a/img/performance.png b/img/performance.png new file mode 100644 index 0000000..3a038bf Binary files /dev/null and b/img/performance.png differ diff --git a/img/scan_performance.png b/img/scan_performance.png new file mode 100644 index 0000000..92e056c Binary files /dev/null and b/img/scan_performance.png differ diff --git a/img/work_efficient_down_sweep.PNG b/img/work_efficient_down_sweep.PNG new file mode 100644 index 0000000..5e51b55 Binary files /dev/null and b/img/work_efficient_down_sweep.PNG differ diff --git a/img/work_efficient_up_sweep.PNG b/img/work_efficient_up_sweep.PNG new file mode 100644 index 0000000..0dd559e Binary files /dev/null and b/img/work_efficient_up_sweep.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 1850161..cf7b0db 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -88,6 +88,20 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); + zeroArray(SIZE, c); + printDesc("thrust scan, power-of-two"); + StreamCompaction::Thrust::scan(SIZE, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, non-power-of-two"); + StreamCompaction::Thrust::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..45c8d90 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,17 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < n) { + int stop = idata[index]; + if (idata[index] == 0) { + bools[index] = 0; + } + else { + bools[index] = 1; + } + } } /** @@ -33,6 +44,13 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index < n) { + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 99a1b04..2d7a781 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -13,6 +13,8 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +const int blockSize = 128; + /** * Check for CUDA errors; print and exit if there was a problem. */ diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..fa6e3fa 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,50 +1,94 @@ -#include -#include "cpu.h" - +#include +#include "cpu.h" + #include "common.h" - -namespace StreamCompaction { - namespace CPU { + +namespace StreamCompaction { + namespace CPU { using StreamCompaction::Common::PerformanceTimer; PerformanceTimer& timer() { static PerformanceTimer timer; return timer; - } - - /** - * CPU scan (prefix sum). - * For performance analysis, this is supposed to be a simple for loop. - * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. - */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - } - - /** - * CPU stream compaction without using the scan function. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - - /** - * CPU stream compaction using scan and scatter, like the parallel version. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - } -} + } + + /** + * CPU scan (prefix sum). + * For performance analysis, this is supposed to be a simple for loop. + * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. + */ + void scan(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + // TODO + int sum = 0; + for (unsigned int i = 0; i < n; i++) { + odata[i] = sum; + sum += idata[i]; + } + + timer().endCpuTimer(); + } + + /** + * CPU stream compaction without using the scan function. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithoutScan(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + // TODO + int count = 0; + for (unsigned int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count] = idata[i]; + count++; + } + } + + timer().endCpuTimer(); + return count; + } + + /** + * CPU stream compaction using scan and scatter, like the parallel version. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithScan(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + // TODO + int* map01 = new int[n]; + int* scanned = new int[n]; + for (unsigned int i = 0; i < n; i++) { + if (idata[i] == 0) { + map01[i] = 0; + } + else { + map01[i] = 1; + } + } + + int sum = 0; + for (unsigned int i = 0; i < n; i++) { + scanned[i] = sum; + sum += map01[i]; + } + + unsigned int count = 0; + for (unsigned int i = 0; i < n; i++) { + //if (scanned[i] != scanned[i - 1]) { + if (map01[i] == 1) { + odata[scanned[i]] = idata[i]; + count++; + } + } + + timer().endCpuTimer(); + + delete map01; + delete scanned; + + return count; + } + } +} diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..94c1618 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,20 +5,108 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // Kernel that does a up-sweep + __global__ void kernUpSweep(int n, int levelPowerOne, int levelPower, int *odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + int divide = index / levelPowerOne; + if (index - (divide * levelPowerOne) == 0) { + odata[index + levelPowerOne - 1] += odata[index + levelPower - 1]; + } + + } + + // Kernel that does a down-sweep + __global__ void kernDownSweep(int n, int levelPowerPlusOne, int levelPower, int *odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + int divide = index / levelPowerPlusOne; + if (index - (divide * levelPowerPlusOne) == 0) { + int temp = odata[index + levelPower - 1]; + odata[index + levelPower - 1] = odata[index + levelPowerPlusOne - 1]; + odata[index + levelPowerPlusOne - 1] += temp; + } + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + scan(n, odata, idata, true); } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + void scan(int n, int *odata, const int *idata, const bool time) { + + // Initialize blockSize and fullBlocksPerGrid + int blockSize = 128; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // Initialize variables and device arrays + int totalLevels = ilog2ceil(n); + int arraySize = pow(2, totalLevels); // To handle non-power of two lengths + int *dev_array; + + // Allocate device array. + cudaMalloc((void**) &dev_array, arraySize * sizeof(int)); + checkCUDAError("cudaMalloc dev_array failed!"); + + // Copy input data into dev_read + cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + if (time) { + timer().startGpuTimer(); + } // TODO - timer().endGpuTimer(); + // Go through the levels for Up Sweep + for (unsigned int level = 0; level <= totalLevels; level++) { + int levelPowerOne = pow(2, level + 1); + int levelPower = pow(2, level); + + // invoke kernel + kernUpSweep << > >(n, levelPowerOne, levelPower, dev_array); + + } + + // Copy values to a temporary array + int* temp_array = new int[arraySize]; + cudaMemcpy(temp_array, dev_array, sizeof(int) * arraySize, cudaMemcpyDeviceToHost); + + // Set the last element to zero + temp_array[arraySize - 1] = 0; + + // Copy array back to GPU + cudaMemcpy(dev_array, temp_array, sizeof(int) * arraySize, cudaMemcpyHostToDevice); + + // Go through the levels for Down Sweep + for (int level = totalLevels - 1; level >= 0; level--) { + int levelPowerPlusOne = pow(2, level + 1); + int levelPower = pow(2, level); + + // invoke kernel + kernDownSweep << > >(n, levelPowerPlusOne, levelPower, dev_array); + + } + + // Copy data from GPU to output array + cudaMemcpy(odata, dev_array, sizeof(int) * n, cudaMemcpyDeviceToHost); + + if (time) { + timer().endGpuTimer(); + } + + // Free memory + cudaFree(dev_array); + delete temp_array; } /** @@ -31,10 +119,72 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int blockSize = 256; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // Device arrays + int *dev_inData; + int *dev_outData; + int *dev_bool; + int *dev_scan; + + // Allocate device array. + cudaMalloc((void**) &dev_inData, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_inData failed!"); + + cudaMalloc((void**) &dev_outData, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_outData failed!"); + + cudaMalloc((void**) &dev_bool, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_bool failed!"); + + cudaMalloc((void**) &dev_scan, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_scan failed!"); + timer().startGpuTimer(); // TODO + + // Map to booleans + cudaMemcpy(dev_inData, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + Common::kernMapToBoolean << < fullBlocksPerGrid, blockSize >> > (n, dev_bool, dev_inData); + + // Create host arrays that will be passed into scan + int *scan_inData = new int[n]; + int *scan_outData = new int[n]; + cudaMemcpy(scan_inData, dev_bool, sizeof(int) * n, cudaMemcpyDeviceToHost); + + bool lastOne = scan_inData[n - 1]; // Remember if last bool is a 1. Will be used later. + + // Scan + scan(n, scan_outData, scan_inData, false); + + // Use result from scan to find how many elements are compacted + int count = scan_outData[n - 1]; + if (lastOne) { + count++; + } + + // Copy scan result to device + cudaMemcpy(dev_scan, scan_outData, sizeof(int) * n, cudaMemcpyHostToDevice); + + // Perform scatter + Common::kernScatter << < fullBlocksPerGrid, blockSize >> > (n, dev_outData, + dev_inData, dev_bool, dev_scan); + + // Copy result to CPU + cudaMemcpy(odata, dev_outData, sizeof(int) * n, cudaMemcpyDeviceToHost); + timer().endGpuTimer(); - return -1; + + // Free memory + cudaFree(dev_inData); + cudaFree(dev_bool); + cudaFree(dev_scan); + + delete scan_inData; + delete scan_outData; + + return count; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..277f0d7 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -8,6 +8,8 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, const bool time); + int compact(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..cd73645 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,23 +3,90 @@ #include "common.h" #include "naive.h" + namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } // TODO: __global__ + // Kernel that does a naive INCLUSIVE scan + __global__ void kernNaiveScan(int n, int levelPower, int *odata, const int *idata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (index >= n) { + return; + } + + if (index >= levelPower) { + odata[index] = idata[index - levelPower] + idata[index]; + } + else { + odata[index] = idata[index]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + // Initialize blockSize and fullBlocksPerGrid + int blockSize = 256; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // Initialize variables and device arrays + int totalLevels = ilog2ceil(n); + int *dev_write; + int *dev_read; + + // Allocate device arrays + cudaMalloc((void**) &dev_write, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_write failed!"); + + cudaMalloc((void**) &dev_read, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_read failed!"); + + // Copy input data into dev_read + cudaMemcpy(dev_read, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + timer().startGpuTimer(); - // TODO + // TODO + + // Go through the levels of Naive scan + for (unsigned int level = 1; level <= totalLevels; level++) { + int levelPower = pow(2, level - 1); + + // invoke kernel + kernNaiveScan << > >(n, levelPower, dev_write, dev_read); + + // Ping-pong write and read arrays + int* temp = dev_write; + dev_write = dev_read; + dev_read = temp; + } + + // Copy final values into temporary array + int* tempArray = new int[n]; + cudaMemcpy(tempArray, dev_read, sizeof(int) * n, cudaMemcpyDeviceToHost); + + // Copy values from tempArray while shifting values to convert inclusive scan to exclusive scan + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = tempArray[i - 1]; + } + timer().endGpuTimer(); + + // Free memory + delete tempArray; + cudaFree(dev_write); + cudaFree(dev_read); + } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..a8bde3a 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,20 +8,32 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + thrust::host_vector dev_thrust_idataHost = thrust::host_vector(idata, idata + n); + thrust::device_vector dev_thrust_idata = dev_thrust_idataHost; + + thrust::host_vector dev_thrust_odataHost = thrust::host_vector(n, 0); + thrust::device_vector dev_thrust_odata = dev_thrust_odataHost; + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(dev_thrust_idata.begin(), dev_thrust_idata.end(), dev_thrust_odata.begin()); + + thrust::copy(dev_thrust_odata.begin(), dev_thrust_odata.end(), odata); + timer().endGpuTimer(); } }