diff --git a/README.md b/README.md index 0e38ddb..94fda1f 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,149 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Janine Liu + * [LinkedIn](https://www.linkedin.com/in/liujanine/), [personal website](https://www.janineliu.com/). +* Tested on: Windows 10, i7-10750H CPU @ 2.60GHz 16GB, GeForce RTX 2070 8192 MB (personal computer) -### (TODO: Your README) +### GPU Stream Compaction -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project involved implementing Scan and Compact algorithms that will be used for in later projects, comparing their performance on both the GPU and CPU with different array sizes. In detail, this project includes: +* a CPU version of Scan, +* a CPU version of Compact without using the Scan algorithm, +* a CPU version of Compact with Scan, +* a naive version of Scan, +* a work-efficient version of Scan, and +* a work-efficient version of Compact that used the work-efficient Scan's code. +All three CPU algorithms are serialized; no multi-threading was incorporated. The Thrust library's version of Scan was also compared with the rest of these algorithms as an additional reference. + +## Performance Analysis Methods + +The CPU and GPU algorithms were timed during their execution, and their times are formatted in an output printed to the terminal. An example of that output is as follows: + +``` +**************** +** SCAN TESTS ** +**************** + [ 18 44 40 43 2 39 23 12 8 5 11 16 31 ... 5 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.008ms (std::chrono Measured) + [ 0 18 62 102 145 147 186 209 221 229 234 245 261 ... 100690 100695 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0069ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 0.04208ms (CUDA Measured) + [ 0 18 62 102 145 147 186 209 221 229 234 245 261 ... 100690 100695 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.038656ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.093504ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.083968ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.04832ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.04752ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 2 2 3 2 3 1 2 0 1 3 2 1 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0082ms (std::chrono Measured) + [ 2 2 3 2 3 1 2 1 3 2 1 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0082ms (std::chrono Measured) + [ 2 2 3 2 3 1 2 1 3 2 1 3 3 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0287ms (std::chrono Measured) + [ 2 2 3 2 3 1 2 1 3 2 1 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.124928ms (CUDA Measured) + [ 2 2 3 2 3 1 2 1 3 2 1 3 3 ... 2 1 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.122176ms (CUDA Measured) + passed +``` + +To collect data, I ran the program five times in succession and record the time values displayed for the "power-of-two" arrays in each implementation. In each run, the program generates a new array of random values for all of the algorithms operate on, so the data input is consistent across timings. I varied the size of the arrays by powers of two, starting from 256. While I did not formally record values for the non power-of-two arrays, I observed on their values and noted when there were any outliers. + +## Scan Runtime Analysis + +The performance of the four scan functions is graphed below. + +![](img/scan_perf.png) + +* Initially, the GPU implementations of Scan seemed much slower than the CPU implementation, since they were tested on an array size of 256 and the CPU's base speed is extremely fast. However, once the array size was increased exponentially, the performance of both GPU implementations slowly approached the CPU one in numerical speed, even surpassing it for an array of 131,072. +* The CPU's performance experiences a **38,932%** speed decrease from an array of 256 to the array of 131,072, while the Naive GPU Scan experiences a **199%** decrease and the Work-Efficient Scan experiences a **343%** decrease respectively. +* On the graph, the CPU has a linear trajectory, while both the Naive and Work-Efficient Scans have a log(n) trajectory; if more array sizes were tested, this predicts that both GPU implementations will be faster than the CPU's, where the larger the array size, the more efficient the GPU implementations are. + +Unfortunately, the Work-Efficient Scan is less efficient than the Naive Scan. Besides its higher base speed, the slope of its graph is higher than that of the Naive Scan's. The implementation of Work-Efficient Scan involves two `for` loops, one for the "up-sweep" of the initial array and one for the "down-sweep" on the result, and thus twice as many kernel calls. Commenting out one of these `for` loops causes Work-Efficient Scan to be slightly faster than Naive Scan (although wrong, of course), which implies that the presence of two `for` loops makes the implementation slow. I confirmed this when I looked at the kernels' runtime through the NSight Analysis interface: + +![](img/scan_cuda_analysis.png) + +The total kernel run-time of Work-Efficient Scan is about two times that of Naive Scan, so Work-Efficient Scan takes twice as much time. This is a problem with the lack of optimization in the kernel calls; the kernels are not efficient enough to balance out the fact that there are twice as many calls. Indeed, many of the threads during each kernel call are being unused, and the array access is not contiguous in memory, contributing to the slowness of the algorithm. + +Interestingly, the Thrust Scan's performance stays relatively constant no matter the size. I could not find it on NSight's Timeline, even when it was the only function being called, so I cannot genuinely analyze its performance or hypothesize why it's so consistent. + +## Compact Runtime Analysis + +The performance of the three compact functions is graphed below. + +![](img/compact_perf.png) + +These results align with the observations found in the Scan part of this project. + +* As with the Scan implementations, the GPU implementation at array size 256 is slower than the CPU implementation, but proves to be more efficient as the array size increases, performing faster than both CPU implementations with an array size of 131,072. +* During Compact without Scan on an array of 131,072 the CPU experiences a **45,400%** decrease in performance compared to its speed at array size 256, while Compact with Scan performs at **24539%** worse than its base speed. This contrasts the **124%** change in performance observed in the Work-Efficient Compact function. +* Aside from the fluctuation in the graph, the function of the Work-Efficient Compact follows a log(n) trajectory with a very small slope. This contrasts the linear trajectories of both CPU implementations, in a way that is similarly demnonstrated in Scan. + +As a whole, even though the base speeds of these work-efficient algorithms are high, they diverge less rapidly from these speeds than their CPU counterparts and outperform them when applied to larger sets of data. + +## Outliers + +While I was collecting data, I observed some abnormal values throughout the program that appeared and disappeared with program refreshes. For some sizes of the randomly-generated array, the algorithms produce fluctuating values that interfere with the expected trajectory of data from their functions. A notable example happens with the cases of Thrust Scan: when I tested an array size of 1021 with block size 128 during my implementation process, the Thrust Scan showed a value of **0.231424 ms**, **five times slower** than its expected average speed. It showed something similar at array size 32,768, spiking again to some value between 0.20 and 0.25ms, equivalent in magnitude to the other lag spike. There was also some interference with the Work-Efficient Scan and Compact algorithms, especially the Work-Efficient Compact. The most out-of-place value I observed was a performance of **0.81392 ms** at array size 32,768. Though this instability explains the fluctuation in the slopes of their graphs, I cannot think of an explanation for why this happens, because the size of the array stays constant (and subsequent trials will show more expected results). + +## Block Size Optimization + +The process to optimize block size was very informal; I ran the program a number of times and eyeballed the difference in values that resulted from different block sizes (though some changes were indiscernable). My testing values for the block size ranged from 32 to 512; these end values were inefficient and it did not make sense to continue testing in either extreme direction. + +* There was a notable difference in performance between block sizes 32 and 64; with an array size of 32,768 and a block size of 32, the Work-Efficient Scan was slower by 0.1ms than its speed with block size 64. +* It was difficult to sense a difference between block sizes 64 and 128. +* A block size of 256 seems to make work efficient scan faster, dropping by 0.2ms on avergae, but slightly taxes work compact 0.1-0.2ms. +* There is much more performance fluctuation with 512, causing the Work-Efficient Scan and Compact to lag abnormally. Therefore, it is unideal for our algorithms. + +## Radix Sort (Extra Credit) + +I implemented radix sort and wrote some test cases for it. Since radix sort draws upon the Scan algorithm for its implementation, its performance relies on the performance of the Work-Efficient algorithm, which is not at its most efficient currently. An example of my test case outputs is below: + +``` +***************************** +** RADIX SORT TESTS ** +***************************** + [ 20 88 42 96 80 14 7 35 81 54 67 80 32 ... 75 0 ] +==== radix, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 99 99 ] + passed +==== radix, non power-of-two ==== + passed +==== radix, unsorted array of 1 - 8 ==== + [ 0 1 2 3 4 5 6 7 ] + passed +==== radix, array with duplicates ==== + [ 1 4 28 42 42 53 129 129 ] + passed +==== radix, backwards sorted array ==== + passed +``` \ No newline at end of file diff --git a/img/compact_perf.png b/img/compact_perf.png new file mode 100644 index 0000000..111540a Binary files /dev/null and b/img/compact_perf.png differ diff --git a/img/example_output.png b/img/example_output.png new file mode 100644 index 0000000..803f059 Binary files /dev/null and b/img/example_output.png differ diff --git a/img/scan_cuda_analysis.png b/img/scan_cuda_analysis.png new file mode 100644 index 0000000..c7e958a Binary files /dev/null and b/img/scan_cuda_analysis.png differ diff --git a/img/scan_perf.png b/img/scan_perf.png new file mode 100644 index 0000000..9f86fc2 Binary files /dev/null and b/img/scan_perf.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..536fa5b 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,11 +13,13 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; +int *d = new int[8]; +int* e = new int[SIZE]; int main(int argc, char* argv[]) { // Scan tests @@ -44,29 +46,33 @@ int main(int argc, char* argv[]) { printDesc("cpu scan, non-power-of-two"); StreamCompaction::CPU::scan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); + //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); - + zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); + /* For debugging purposes + for (int i = 0; i < 8; i++) { + d[i] = i; + } + printDesc("1 - 8 array"); + StreamCompaction::Efficient::scan(8, c, d); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(8, c, true); + */ + zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); @@ -137,7 +143,7 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); @@ -147,8 +153,89 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + printf("\n"); + printf("*****************************\n"); + printf("** RADIX SORT TESTS **\n"); + printf("*****************************\n"); + + genArray(SIZE - 1, a, 100); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + printDesc("radix, power-of-two"); + StreamCompaction::Radix::radixSort(SIZE, c, a); + printArray(SIZE, c, true); + for (int i = 0; i < SIZE - 1; i++) { + if (c[i] > c[i + 1]) { + printf(" FAILED\n"); + break; + } + else if (i == SIZE - 2) { + printf(" passed\n"); + } + } + + printDesc("radix, non power-of-two"); + StreamCompaction::Radix::radixSort(NPOT, c, a); + for (int i = 0; i < NPOT - 1; i++) { + if (c[i] > c[i + 1]) { + printf(" FAILED\n"); + printArray(NPOT, c, true); + break; + } + else if (i == NPOT - 2) { + printf(" passed\n"); + } + } + + d[0] = 4, d[1] = 7, d[2] = 2, d[3] = 6, d[4] = 3, d[5] = 5, d[6] = 1, d[7] = 0; + printDesc("radix, unsorted array of 1 - 8"); + StreamCompaction::Radix::radixSort(8, c, d); + printArray(8, c, true); + for (int i = 0; i < 8 - 1; i++) { + if (c[i] > c[i + 1]) { + printf(" FAILED\n"); + break; + } + else if (i == 8 - 2) { + printf(" passed\n"); + } + } + + d[0] = 42, d[1] = 42, d[2] = 129, d[3] = 129, d[4] = 53, d[5] = 1, d[6] = 4, d[7] = 28; + printDesc("radix, array with duplicates"); + StreamCompaction::Radix::radixSort(8, c, d); + printArray(8, c, true); + for (int i = 0; i < 8 - 1; i++) { + if (c[i] > c[i + 1]) { + printf(" FAILED\n"); + break; + } + else if (i == 8 - 2) { + printf(" passed\n"); + } + } + + for (int i = 0; i < SIZE; i++) { + b[i] = SIZE - i; + } + + printDesc("radix, backwards sorted array"); + StreamCompaction::Radix::radixSort(SIZE, c, b); + for (int i = 0; i < SIZE - 1; i++) { + if (c[i] > c[i + 1]) { + printf(" FAILED\n"); + break; + } + else if (i == SIZE - 2) { + printf(" passed\n"); + } + } + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; delete[] c; + delete[] d; + delete[] e; } diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..a9c94ca 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,11 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + bools[index] = idata[index] == 0 ? 0 : 1; } /** @@ -32,7 +36,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + if (idata[index] != 0) { + odata[bools[index]] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..14af4c3 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,16 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + for (int i = 0; i < n; i++) { + odata[i] = idata[i]; + } + + // exclusive scan + odata[0] = 0; + + for (int i = 0; i < n - 1; i++) { + odata[i + 1] = odata[i] + idata[i]; + } timer().endCpuTimer(); } @@ -30,9 +39,19 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + const int size = n; + // Map to temp array + + int oIndex = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[oIndex] = idata[i]; + oIndex++; + } + } + timer().endCpuTimer(); - return -1; + return oIndex; } /** @@ -42,9 +61,35 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + const int size = n; + + // Map to temp array + int* tempArray = new int[size]; + for (int i = 0; i < n; i++) { + tempArray[i] = (idata[i] == 0) ? 0 : 1; + } + + // Exclusive scan + int* scannedArray = new int[size]; + for (int i = 0; i < n; i++) { + scannedArray[i] = tempArray[i]; + } + scannedArray[0] = 0; + for (int i = 0; i < n - 1; i++) { + scannedArray[i + 1] = scannedArray[i] + tempArray[i]; + } + + // Scatter + int count = 0; + for (int i = 0; i < n; i++) { + if (tempArray[i] == 1) { + odata[scannedArray[i]] = idata[i]; + count++; + } + } + timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..5d6b4ca 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,26 @@ #include "common.h" #include "efficient.h" +#define blockSize 256 +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) + + +// General globals +int* dev_data; +int* dev_oData; + +// Scan and Compact +int* dev_scanData; +int* dev_boolData; + +// Radix sort +int* dev_bData; +int* dev_eData; +int* dev_fData; +int* dev_tData; +int* dev_dData; +int* dev_totalFalses; + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +32,68 @@ namespace StreamCompaction { return timer; } + __global__ void kernStepUpSweep(int n, int *data, int pow2) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + if (index % (2 * pow2) == 0) { + data[index + 2 * pow2 - 1] += data[index + pow2 - 1]; + } + + } + + __global__ void kernSetRootNode(int n, int* data) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + data[n - 1] = 0; + } + + __global__ void kernStepDownSweep(int n, int* data, int pow2) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + if (index % (2 * pow2) == 0) { + int t = data[index + pow2 - 1]; + data[index + pow2 - 1] = data[index + 2 * pow2 - 1]; + data[index + 2 * pow2 - 1] += t; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + void scan(int n, int *odata, const int *idata) { + int numBlocks = ceil((float)n / (float)blockSize); + int log2n = ilog2ceil(n); + const int size = (int)powf(2, log2n); + + cudaMalloc((void**)&dev_data, sizeof(int) * size); + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + for (int d = 0; d <= log2n - 1; d++) { + kernStepUpSweep <<>> (size, dev_data, (int)powf(2, d)); + } + + kernSetRootNode << <1, 1 >> > (size, dev_data); + + for (int d = log2n - 1; d >= 0; d--) { + kernStepDownSweep << > > (size, dev_data, (int)powf(2, d)); + } + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(dev_data); } /** @@ -31,10 +106,170 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int numBlocks = ceil((float)n / (float)blockSize); + int log2n = ilog2ceil(n); + const int size = (int)powf(2, log2n); + + cudaMalloc((void**)&dev_data, sizeof(int) * size); + cudaMalloc((void**)&dev_boolData, sizeof(int) * size); + cudaMalloc((void**)&dev_oData, sizeof(int) * n); + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + // Make temporary array + StreamCompaction::Common::kernMapToBoolean << > > (size, dev_boolData, dev_data); + + // Scan + for (int d = 0; d <= log2n - 1; d++) { + kernStepUpSweep << > > (size, dev_boolData, (int)powf(2, d)); + } + + kernSetRootNode << <1, 1 >> > (size, dev_boolData); + + for (int d = log2n - 1; d >= 0; d--) { + kernStepDownSweep << > > (size, dev_boolData, (int)powf(2, d)); + } + + StreamCompaction::Common::kernScatter << > > (n, dev_oData, dev_data, dev_boolData, nullptr); + timer().endGpuTimer(); - return -1; + + int* boolArray = new int[n]; + cudaMemcpy(odata, dev_oData, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaMemcpy(boolArray, dev_boolData, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(dev_data); + cudaFree(dev_boolData); + cudaFree(dev_oData); + + if (idata[n - 1] == 0) { + return boolArray[n - 1]; + } + + return boolArray[n - 1] + 1; } } + + + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // bit number goes from 1 to n + __global__ void kernComputeBArray(int n, int bitNumber, int* bdata, const int* idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + int data = idata[index]; + int bit = (data & (1 << bitNumber - 1)) != 0; + bdata[index] = bit; + } + + __global__ void kernComputeEArray(int n, int* edata, int* fdata, const int* bdata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + edata[index] = !(bdata[index]); + fdata[index] = edata[index]; + } + + __global__ void kernComputeTotalFalses(int n, int* out, int* edata, int* fdata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + *out = edata[n - 1] + fdata[n - 1]; + } + + __global__ void kernComputeTArray(int n, int* tdata, int* edata, int* fdata, int* totalFalses) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + tdata[index] = index - fdata[index] + *totalFalses; + } + + __global__ void kernComputeDArray(int n, int* ddata, int* bdata, int* fdata, int* tdata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + ddata[index] = bdata[index] ? tdata[index] : fdata[index]; + } + + __global__ void kernScatterRadix(int n, int* odata, int* ddata, const int* idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + odata[ddata[index]] = idata[index]; + } + + void radixSort(int n, int* odata, const int* idata) { + int numBlocks = ceil((float)n / (float)blockSize); + int log2n = ilog2ceil(n); + const int size = (int)powf(2, log2n); + + cudaMalloc((void**)&dev_data, sizeof(int) * n); + cudaMalloc((void**)&dev_bData, sizeof(int) * n); + cudaMalloc((void**)&dev_eData, sizeof(int) * n); + cudaMalloc((void**)&dev_fData, sizeof(int) * size); + cudaMalloc((void**)&dev_tData, sizeof(int) * n); + cudaMalloc((void**)&dev_dData, sizeof(int) * n); + cudaMalloc((void**)&dev_totalFalses, sizeof(int)); + + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + // Find max element. + int maxElement = 0; + for (int i = 0; i < n; i++) { + if (idata[i] > maxElement) + maxElement = idata[i]; + } + + for (int bitNumber = 1; maxElement /((int) powf(2, bitNumber - 1)) > 0; bitNumber++) { + kernComputeBArray << > > (n, bitNumber, dev_bData, dev_data); + kernComputeEArray << > > (n, dev_eData, dev_fData, dev_bData); + + + // Compute f Array + for (int d = 0; d <= log2n - 1; d++) { + StreamCompaction::Efficient::kernStepUpSweep << > > (size, dev_fData, (int)powf(2, d)); + } + + StreamCompaction::Efficient::kernSetRootNode << <1, 1 >> > (size, dev_fData); + + for (int d = log2n - 1; d >= 0; d--) { + StreamCompaction::Efficient::kernStepDownSweep << > > (size, dev_fData, (int)powf(2, d)); + } + + kernComputeTotalFalses << <1, 1 >> > (n, dev_totalFalses, dev_eData, dev_fData); + kernComputeTArray << > > (n, dev_tData, dev_eData, dev_fData, dev_totalFalses); + kernComputeDArray << > > (n, dev_dData, dev_bData, dev_fData, dev_tData); + kernScatterRadix << > > (n, dev_data, dev_dData, dev_data); + } + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(dev_data); + cudaFree(dev_bData); + cudaFree(dev_eData); + cudaFree(dev_fData); + cudaFree(dev_tData); + cudaFree(dev_dData); + cudaFree(dev_totalFalses); + + } + } + } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..f004913 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -10,4 +10,11 @@ namespace StreamCompaction { int compact(int n, int *odata, const int *idata); } + + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); + + void radixSort(int n, int* odata, const int* idata); + } + } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..b9b6c71 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,6 +2,13 @@ #include #include "common.h" #include "naive.h" +#include + +#define blockSize 256 +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) + +int *dev_idata; +int* dev_odata; namespace StreamCompaction { namespace Naive { @@ -11,15 +18,65 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernStepNaiveScan(int n, int *odata, int *idata, int pow2) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + if (index >= pow2) { + odata[index] = idata[index - pow2] + idata[index]; + } + else { + odata[index] = idata[index]; + } + } + + __global__ void kernMakeExclusive(int n, int *odata, int *idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + + if (index == 0) { + odata[index] = 0; + } + else { + odata[index] = idata[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int numBlocks = ceil((float)n / (float)blockSize); + const int size = n; + cudaMalloc((void**)&dev_idata, sizeof(int) * n); + cudaMalloc((void**)&dev_odata, sizeof(int) * n); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + int log2n = ilog2ceil(n); + for (int d = 1; d <= log2n; d++) { + kernStepNaiveScan << > > (n, dev_odata, dev_idata, (int) powf(2, d - 1)); + if (d < log2n) { + int* tempPtr = dev_odata; + dev_odata = dev_idata; + dev_idata = tempPtr; + } + } + + // The correct data will be in odata, now have to make exclusive and store + // in idata, contrary to the original name's intention + kernMakeExclusive << > > (n, dev_idata, dev_odata); timer().endGpuTimer(); + + cudaMemcpy(odata, dev_idata, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..c473f0a 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -6,6 +6,9 @@ #include "common.h" #include "thrust.h" +int* dev_inData; +int* dev_outData; + namespace StreamCompaction { namespace Thrust { using StreamCompaction::Common::PerformanceTimer; @@ -18,10 +21,14 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::host_vector host_v(n); + thrust::copy(idata, idata + n, host_v.begin()); + thrust::device_vector dev_v = host_v; + thrust::device_vector out_v(n); + thrust::exclusive_scan(dev_v.begin(), dev_v.end(), out_v.begin()); + thrust::copy(out_v.begin(), out_v.end(), odata); timer().endGpuTimer(); } }