diff --git a/README.md b/README.md index 0e38ddb..d611e93 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,244 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Keyi Yu + * [LinkedIn](https://www.linkedin.com/in/keyi-linda-yu-8b1178137/) +* Tested on: Windows 10, i7-10750H @ 2.60GHz 16GB, GeForce RTX 2070 with Max-Q 8228MB (My Machine) -### (TODO: Your README) +Project2-Stream Compaction +===================================== -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +In this project, I implemented Scan(Prefix Sum) in both CPU and GPU versions, and GPU stream compaction in CUDA from scratch. Then I did some tests on both power-of-two-sized arrays and non-power-of-two-sized arrays. These algorithms are widely used and will be important for accelerating path tracer later on. +![](images/1.gif) +Contects +------------------------------------- +- [Introduction](#Introduction) +- [Scan](#Scan) + - [Method1: CPU](#Method1:-CPU) + - [Method2: Naive](#Method2:-Naive-GPU) + - [Method3: Work Efficient](#Method3:-Work-Efficient-GPU) +- [Stream Compaction](#Stream-Compaction) + - [Method1: CPU](#Method1:-CPU) + - [Method2: Work Efficient](#Method3:-Work-Efficient-GPU) +- [Performance Analysis](#Performance-Analysis) +- [Debug](#Debug) +- [Questions](#Questions) +- [Extra Credit](#Extra-Credit) +- [Reference](#Reference) + +Introduction +------------------------------------- +A simple and common parallel algorithm building block is the all-prefix-sums operation. The all-prefix-sums operation on an array of data is commonly known as scan. There are many uses for scan, including sorting, lexical analysis, string comparison and etc. (Reference from [NVIDIA GPU Gem3](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html)) In this project, I implemented stream compaction with the scan operation. + +Scan +------------------------------------- +Scan has two types: exlusive scan and inclusive scan. An exlusive scan can be generated from inclusive scan by shifting the array right by one element and inerting the identity. In this project, I implemented exclusive scan. + +### Method1: CPU +The idea of implementing scan on CPU is very straightforward. I used a for loop to iterate the array and add each entry to the previous sum. The prefixSum represents the sum of the previous i entries. So prefixSum[0] = 0. In this way, the result of the prefixSum is the exclusive scan. The pesudo code is: +``` +for each entry arr[i] in arr from [0, n - 1]: + prefixSum[i] = arr[i - 1] + prefixSum[i - 1] +``` + +### Method2: Navie GPU +This method is similar to parallel reduction. This algorithm is based on the scan algorithm presented by Hillis and Steele (1986) and demonstrated for GPUs by Horn (2005). Figure below shows the procedure. + +We can notice that this method performs O(nlogn) addition operations. Recall that Method1 only performs O(n) addition operaions. Therefore, this naive scan is not work-efficient. + +Another problem of the peusdo code shown in the picture is that in-place algorithm doesn't work if the array size is larger than the warp size(warp size is 32 in G80). So we have to use two arrays: input array and output array. At each iteration, we will do the add operation and store the results in the output array. In the next iteration, the previous output array will be the input array and so on and so forth. Peusdo code in [Example 2](#img/example-2.jpg) shows this idea. + +![](img/naivescan.png) +![naivescan](img/example-2.jpg) + +### Method3: Work Efficient GPU +This algorithm is based on the one presented by Blelloch (1990). The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. + +Step1: Up Sweep +![](img/upsweep.png) + +Step2: Down Sweep +![](img/downsweep.png) + +Stream Compaction +------------------------------------- +### Method1: CPU +This method is to iterate the array and select those we want. +``` +for each entry arr[i] from arr in range[0, n - 1]: + if arr[i] == 1 + Add it to the result arr + else + do nothing +``` +### Method2: Work-efficient GPU +With the work-efficient scan implemented above, it is easy to implement this algorithm. This algoeithm has three steps: + +1. In parallel, determine whether each entry is 0 or not. +``` +if entry arr[i] == 0: + map[i] = 0; +else: + map[i] = 1; +``` + +2. Use work-efficient scan on the map array in step1 to get exclusive scan result. +3. In parallel, write the non-zero entry i to the output array with index map[i] + +## Performance Analysis and Questions +### Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU. +I changed the blocksize when the array size was 2^20. The figure below shows the performance of three algorithms with different blocksize. As we can see, when the blocksize is 128, 256, and 512, the three methods all perform better. So I choose 128 as the blocksize for my following analysis. + +![](img/performance-vs-blocksize.png) + +### Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis). +Here are some graphs and output showing the performance of the four methods mentioned above. +1. Performance vs Array size (Power-of-Two) + +Large range | Small range +:-------------------------:|:-------------------------: +![](img/performance-vs-size-power.png) | ![](img/performance-smaller.png) + + +2. Performance vs Array size (Non-Power-of-Two) + +Large range | Small range +:-------------------------:|:-------------------------: +![](img/performance-vs-size-non-power.png) | ![](img/performance-non-smaller.png) + +The array size greatly affected the performance of the four algorithms. When the data size is small(i.e. 256 or 65536), CPU performs better than the GPU algorithms. However, as the size of the array increases, GPU performs better than(up to 10 times) the CPU. + +I read the Nsight performance analysis and noticed that there is still memory transfer from host to device or device to host. As the size of the array increases, CPU costs much more time to compute and thus those memory transfer time is negligible. Also there is a cudaCreateEvent call, which may cost some time. I think these time cost would be the bottleneck of the gpu algorithms, even for the thrust libaray implementation. + +3. POT vs NPOT + +![](img/pot-vs-npot.png) + +The figure shows the performance of thrust::exclusive_scan on Power-of-Two array and Non-Power-of-Two array. The result is expected because we need to pad 0 to the NPOT before UpSweep operation, which will cost extra time and memory. I guess the reason why the performance is quite close is that the size of the NPOT array is slightly smaller that the size of POT array. + + +4. Output in CMD +``` +**************** +** SCAN TESTS ** +**************** + [ 0 30 45 48 26 27 21 49 46 22 34 10 47 ... 17 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 4.3786ms (std::chrono Measured) + [ 0 0 30 75 123 149 176 197 246 292 314 348 358 ... 25686177 25686194 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 1.4756ms (std::chrono Measured) + [ 0 0 30 75 123 149 176 197 246 292 314 348 358 ... 25686095 25686126 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.7624ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.76464ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 1.03267ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 1.03699ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.196608ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.211552ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 2 1 0 0 1 1 3 0 0 2 0 1 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 2.3306ms (std::chrono Measured) + [ 2 2 1 1 1 3 2 1 1 2 3 2 2 ... 1 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 2.3035ms (std::chrono Measured) + [ 2 2 1 1 1 3 2 1 1 2 3 2 2 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 9.6631ms (std::chrono Measured) + [ 2 2 1 1 1 3 2 1 1 2 3 2 2 ... 1 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 1.14995ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 1.15507ms (CUDA Measured) + passed +``` + + + + +## Debug +1. Navie Scan + +- I ping-pong buffers to keep arr1 as the input data and arr2 as the output data. But I found it inefficient. So I use a flag to denote which one is the input arr at each iteration. +- I need to determine which array is the final result. At first, I thought dev_arr2 will always be the result. But as I change the array size, dev_arr1 and dev_arr2 are both possible answers. + +2. Efficient scan: My program works well with array size < 16 but crashed with array size >= 16. Error: Illigal memory +- Bug1: I forget to free the device buffers. +- Bug2(Main problem): Firstly, I try to avoid module operations when implementing the kernUpSweep and kernDownSweep. Unluckily, it looks like my implementation has problems. So I still use mod operation to determine whether the current thread should do the calculations. + +## Extra Credit +1. Improve performance of work-efficient scan + + The work-efficient scan performs worse than the naive scan. Because more and more threads become lazy after several iterations, we need to choose the proper number of threads at each iteration. + + So we need a mapping from the threadId to the array index. + ``` + compute the # of number changed at each iteration, stride, start index + threadId is from 0 to NumberChangedAtEachIteration + array index = threadId * strideAtEachIteration + startIndexAtEachIteration + ``` + Number of threads needed each time is greatly reduced by doing so. Obviously, the performance is much better than the previous version. But there is still a gap between the optimized method and thrust exslusive scan. I think maybe I can use the optimization method mentioned in CUDA Performance class and use shared memory. +``` +**************** +** SCAN TESTS ** +**************** + [ 32 37 45 13 5 4 16 28 34 3 47 5 21 ... 45 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 65.1664ms (std::chrono Measured) + [ 0 32 69 114 127 132 136 152 180 214 217 264 269 ... 410887714 410887759 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 24.5645ms (std::chrono Measured) + [ 0 32 69 114 127 132 136 152 180 214 217 264 269 ... 410887594 410887640 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 13.1808ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 13.1909ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 17.7448ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 17.7584ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.685632ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.69632ms (CUDA Measured) + passed +==== work-efficient scan optimized, power-of-two ==== + elapsed time: 5.94595ms (CUDA Measured) + passed +==== work-efficient scan optimized, non-power-of-two ==== + elapsed time: 5.95277ms (CUDA Measured) + passed + +``` + +## Reference + +1. [Parallel Algotithms Slides](#https://onedrive.live.com/view.aspx?resid=A6B78147D66DD722!95162&ithint=file%2cpptx&authkey=!AFs_WCO3UbQLC40) +2. [NVIDIA GPU Gem3](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) diff --git a/img/downsweep.png b/img/downsweep.png new file mode 100644 index 0000000..a1944c4 Binary files /dev/null and b/img/downsweep.png differ diff --git a/img/naivescan.png b/img/naivescan.png new file mode 100644 index 0000000..f45f2d0 Binary files /dev/null and b/img/naivescan.png differ diff --git a/img/performance-non-smaller.png b/img/performance-non-smaller.png new file mode 100644 index 0000000..d97b0bc Binary files /dev/null and b/img/performance-non-smaller.png differ diff --git a/img/performance-smaller.png b/img/performance-smaller.png new file mode 100644 index 0000000..5469ae1 Binary files /dev/null and b/img/performance-smaller.png differ diff --git a/img/performance-vs-blocksize.png b/img/performance-vs-blocksize.png new file mode 100644 index 0000000..c04088b Binary files /dev/null and b/img/performance-vs-blocksize.png differ diff --git a/img/performance-vs-size-non-power.png b/img/performance-vs-size-non-power.png new file mode 100644 index 0000000..b69b111 Binary files /dev/null and b/img/performance-vs-size-non-power.png differ diff --git a/img/performance-vs-size-power.png b/img/performance-vs-size-power.png new file mode 100644 index 0000000..34a6ce4 Binary files /dev/null and b/img/performance-vs-size-power.png differ diff --git a/img/pot-vs-npot.png b/img/pot-vs-npot.png new file mode 100644 index 0000000..513d484 Binary files /dev/null and b/img/pot-vs-npot.png differ diff --git a/img/upsweep.png b/img/upsweep.png new file mode 100644 index 0000000..d92cba8 Binary files /dev/null and b/img/upsweep.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..e69ab2e 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 24; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -47,6 +47,7 @@ int main(int argc, char* argv[]) { printArray(NPOT, b, true); printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); @@ -95,6 +96,22 @@ int main(int argc, char* argv[]) { //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); + printDesc("work-efficient scan optimized, power-of-two"); + StreamCompaction::Efficient::scanOptimized(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan optimized, non-power-of-two"); + StreamCompaction::Efficient::scanOptimized(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + + printf("\n"); printf("*****************************\n"); printf("** STREAM COMPACTION TESTS **\n"); diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..eb1daf2 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -12,6 +12,7 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define blockSize 128 /** * Check for CUDA errors; print and exit if there was a problem. diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..a316eaf 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -20,6 +20,10 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = idata[i - 1] + odata[i - 1]; + } timer().endCpuTimer(); } @@ -31,8 +35,14 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int index = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[index++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return index; } /** @@ -43,8 +53,32 @@ namespace StreamCompaction { int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int* mapdata = new int[n]; + int* scanned = new int[n]; + int count = 0; + + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + mapdata[i] = 1; + } + else { + mapdata[i] = 0; + } + } + scanned[0] = 0; + for (int i = 1; i < n; i++) { + scanned[i] = scanned[i - 1] + mapdata[i - 1]; + } + + for (int i = 0; i < n; i++) { + if (mapdata[i] == 1) { + odata[scanned[i]] = idata[i]; + count++; + } + } + timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..37d2dfb 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,24 +3,240 @@ #include "common.h" #include "efficient.h" + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; + int* dev_extend; + + int* dev_map; + int* dev_scan; + int* dev_scatter; + int* dev_data; + PerformanceTimer& timer() { static PerformanceTimer timer; return timer; } + __global__ void kernUpSweep(int n, int d, int* data) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int power = 1 << (d + 1); + int step = 1 << d; + if (index != 0 && (index + 1) % power == 0) { + data[index] += data[index - step]; + } + } + + __global__ void kernUpSweepOptimized(int n, int stride, int* data, int start) { + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + if (k >= n) { + return; + } + + int index = k * stride + start; + data[index] += data[index - stride / 2]; + + + } + + __global__ void kernDownSweep(int n, int d, int* data) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int power = 1 << (d + 1); + int step = power >> 1; + + if (index != 0 && (index + 1) % power == 0) { + int t = data[index - step]; + data[index - step] = data[index]; + data[index] += t; + } + } + + __global__ void kernDownSweepOptimized(int n, int stride, int* data, int start) { + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + if (k >= n) { + return; + } + + int index = start + k * stride; + int power = stride; + if (index != 0 && (index + 1) % power == 0) { + int t = data[index - stride / 2]; + data[index - stride / 2] = data[index]; + data[index] += t; + } + } + + __global__ void kernExtendArr(int extendNum, int n, int* idata, int* odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= extendNum) { + return; + } + if (index >= n) { + odata[index] = 0; + } + else { + odata[index] = idata[index]; + } + } + + __global__ void kernMap(int n, int* idata, int* odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + odata[index] = idata[index] == 0 ? 0 : 1; + } + + __global__ void kernSetValue(int n, int value, int* data) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index == n) { + data[index] = value; + } + else { + return; + } + } + + __global__ void kernSetValueOptimized(int n, int value, int* data, int stride) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + data[index + stride - 1] = value; + } + + __global__ void kernScatter(int n, int* idata, int* scan, int* odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + if (idata[index] != 0) { + odata[scan[index]] = idata[index]; + } + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO + + dim3 threadsPerBlock(blockSize); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // Expand non power-2 to power-2 + int ceil = ilog2ceil(n); + int num = 1 << ceil; + int* extendData = new int[num]; + int* tmp = new int[num]; + + cudaMalloc((void**)&dev_extend, num * sizeof(int)); + checkCUDAError("dev_arrr failed!"); + + cudaMalloc((void**)&dev_data, n * sizeof(int)); + checkCUDAError("dev_arrr failed!"); + + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + kernExtendArr << > > (num, n, dev_data, dev_extend); + + for (int d = 0; d <= ceil; d++) { + kernUpSweep << > > (num, d, dev_extend); + } + + kernSetValue << > > (num - 1, 0, dev_extend); + + for (int d = ceil - 1; d >= 0; d--) { + kernDownSweep << > > (num, d, dev_extend); + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_extend, n * sizeof(int), cudaMemcpyDeviceToHost); + + + + /* + printf("_________________test____________________\n"); + for (int i = 0; i < n; i++) { + printf("%3d ", odata[i]); + } + */ + cudaFree(dev_extend); + cudaFree(dev_data); + + delete[] tmp; + delete[] extendData; } + void scanOptimized(int n, int* odata, const int* idata) { + dim3 threadsPerBlock(blockSize); + + + // Expand non power-2 to power-2 + int ceil = ilog2ceil(n); + int num = 1 << ceil; + int* extendData = new int[num]; + int* tmp = new int[num]; + + cudaMalloc((void**)&dev_extend, num * sizeof(int)); + checkCUDAError("dev_arrr failed!"); + + cudaMalloc((void**)&dev_data, n * sizeof(int)); + checkCUDAError("dev_arrr failed!"); + + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + + dim3 fullBlocksPerGrid((num + blockSize - 1) / blockSize); + + timer().startGpuTimer(); + + kernExtendArr << > > (num, n, dev_data, dev_extend); + + for (int d = 1; d <= ceil; d++) { + int threadNum = 1 << (ceil - d); + int stride = 1 << d; + int start = stride - 1; + fullBlocksPerGrid = (threadNum + blockSize - 1) / blockSize; + + kernUpSweepOptimized << > > (threadNum, stride, dev_extend, start); + } + + kernSetValueOptimized << <1, threadsPerBlock >> > (1, 0, dev_extend, num); + + for (int d = ceil - 1; d >= 0; d--) { + int threadNum = 1 << (ceil - d - 1); + int stride = 1 << (d + 1); + int start = stride - 1; + fullBlocksPerGrid = (threadNum + blockSize - 1) / blockSize; + + kernDownSweepOptimized << > > (threadNum, stride, dev_extend, start); + } + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_extend, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_extend); + cudaFree(dev_data); + + delete[] tmp; + delete[] extendData; + } + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +247,78 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + + int ceil = ilog2ceil(n); + int num = 1 << ceil; + + int* host_scan = new int[num]; + int* tmp = new int[num]; + + dim3 threadsPerBlock(blockSize); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + + cudaMalloc((void**)&dev_map, num * sizeof(int)); + checkCUDAError("dev_map failed!"); + + + cudaMalloc((void**)&dev_scan, num * sizeof(int)); + checkCUDAError("dev_scan failed!"); + + + cudaMalloc((void**)&dev_scatter, num * sizeof(int)); + checkCUDAError("dev_scatter failed!"); + + cudaMalloc((void**)&dev_data, n * sizeof(int)); + checkCUDAError("dev_data failed!"); + + cudaMalloc((void**)&dev_extend, num * sizeof(int)); + checkCUDAError("dev_extend failed!"); + + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); // TODO + // Extend non-power of 2 + kernExtendArr << > > (num, n, dev_data, dev_extend); + + // map + kernMap << > > (num, dev_extend, dev_scan); + + // scan + for (int d = 0; d <= ceil; d++) { + kernUpSweep << > > (num, d, dev_scan); + } + + kernSetValue << > > (num - 1, 0, dev_scan); + + + for (int d = ceil - 1; d >= 0; d--) { + kernDownSweep << > > (num, d, dev_scan); + } + // scatter + kernScatter << > > (num, dev_extend, dev_scan, dev_scatter); timer().endGpuTimer(); - return -1; + + cudaMemcpy(odata, dev_scatter, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(host_scan, dev_scan, num * sizeof(int), cudaMemcpyDeviceToHost); + + + cudaFree(dev_extend); + cudaFree(dev_data); + cudaFree(dev_map); + cudaFree(dev_scan); + cudaFree(dev_scatter); + + int count = host_scan[n - 1]; + if (1 << ceil != n) { + count = host_scan[n]; + } + delete[] host_scan; + delete[] tmp; + + return count; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..d1db131 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -8,6 +8,8 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scanOptimized(int n, int* odata, const int* idata); + int compact(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..f00cfd4 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,7 @@ #include "common.h" #include "naive.h" + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +12,106 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } + // TODO: __global__ + __global__ void naiveScanParallel(int n, int power, int* idata, int* odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + if (index >= power) { + odata[index] = idata[index - power] + idata[index]; + } + else { + odata[index] = idata[index]; + } + } + + // Convert inclusive to exclusive + __global__ void convert(int n, int* idata, int* odata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + if (index == 0) { + odata[index] = 0; + } + else { + odata[index] = idata[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + void scan(int n, int* odata, const int* idata) { // TODO + dim3 threadsPerBlock(blockSize); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int* dev_arr1; + int* dev_arr2; + int direction = 1; + int* tmp = new int[n]; + + cudaMalloc((void**)&dev_arr1, n * sizeof(int)); + checkCUDAError("dev_arrr1 failed!"); + cudaMalloc((void**)&dev_arr2, n * sizeof(int)); + checkCUDAError("dev_arrr2 failed!"); + + cudaMemcpy(dev_arr1, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); + + int ceil = ilog2ceil(n); + for (int d = 1; d <= ceil; d++) { + int power = 1 << (d - 1); + if (direction == 1) { + naiveScanParallel << > > (n, power, dev_arr1, dev_arr2); + } + else { + naiveScanParallel << > > (n, power, dev_arr2, dev_arr1); + } + /* + printf("level %d \n", d); + for (int i = 0; i < n; i++) { + printf("%3d ", tmp[i]); + } + printf("\n"); + */ + direction *= -1; + } + if (direction == 1) { + convert << > > (n, dev_arr1, dev_arr2); + /* + printf("result %d \n"); + for (int i = 0; i < n; i++) { + printf("%3d ", odata[i]); + } + printf("\n"); + */ + } + else { + convert << > > (n, dev_arr2, dev_arr1); + /* + printf("result %d \n"); + for (int i = 0; i < n; i++) { + printf("%3d ", odata[i]); + } + printf("\n"); + */ + } timer().endGpuTimer(); + if (direction == 1) { + cudaMemcpy(odata, dev_arr2, n * sizeof(int), cudaMemcpyDeviceToHost); + } + else { + cudaMemcpy(odata, dev_arr1, n * sizeof(int), cudaMemcpyDeviceToHost); + } + + cudaFree(dev_arr1); + cudaFree(dev_arr2); + + delete[] tmp; + } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..cc0b964 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -14,15 +14,38 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin + + //thrust::exclusive_scan(idata, idata + n, odata); + + int* dev_idata, * dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + + thrust::device_ptr dev_thrust_odata(dev_odata); + thrust::device_ptr dev_thrust_idata(dev_idata); + + timer().startGpuTimer(); + thrust::exclusive_scan(dev_thrust_idata, dev_thrust_idata + n, dev_thrust_odata); + timer().endGpuTimer(); + + cudaMemcpy(odata, thrust::raw_pointer_cast(dev_thrust_odata), n * sizeof(int), cudaMemcpyDeviceToHost); + /* + for (int i = 0; i < n; i++) { + //printf("%d ", odata[i]); + } + */ } } }