diff --git a/README.md b/README.md index 0e38ddb..e00d51a 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,137 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yichen Shou + * [LinkedIn](https://www.linkedin.com/in/yichen-shou-68023455/), [personal website](http://www.yichenshou.com/) +* Tested on: Windows 10, i7-6500U @ 2.50GHz 12GB RAM, GeForce 940M 8GB (Personal Laptop) -### (TODO: Your README) +## Project Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This projects implements parallel reduction and stream compaction on the GPU using CUDA. Non-parallel reduction and compaction are first implemented on the CPU sequentially to form a ground truth to compare with. Both naive and efficient versions of the parallel algorithms are then implemented on the GPU to see the degree of speed up. +Before starting, I tested the speed of my GPU with different block sizes. Sizes of 128, 512 and 1024 performed relatively similar to each other, while 256 was a bit slower. In the end I chose the block size to be 128. Also note that the size of the arrays I tested with only go up to 2^24. After that I start to run out of memory/run into int overflow problems. + +### What is parallel reduction and stream compaction? + +Parallel reduction refers to any operations that take an array of values in and spits out a single value. In this case, I'm computing the sum given an array of ints. This issue is pretty trivial on the CPU: just do a for loop and add every element. If we store the sum of every value before index 0 to i at index i, then we get an _inclusive scan_. On the other hand, storing the sum of every value from index 0 to i - 1 at index i yields an _exclusive scan_. + +Stream compaction uses exclusive scans to remove unwanted values from an array. From an initial array A of size n, a boolean array, B, of the same size is produced. B[i] will contain a 1 if the A[i] contains a valid value, and a 0 if A[i] contains an invalid value. An exlcusive scan can then be run on the boolean array B to produce array C. Finally, you can iterate through B again. This time, if B[i] is a 1, you can look at C[i] for an index value, and put A[i] at that index value in the final output array. This final scan through B that determines the indices of valid values is called a _scatter_, and it will throw out any invalid values and pack only valid values into the final array. Stream compaction is very useful in programs like path tracers. + +![](img/compactExplanation.PNG) + +### How to implement this on the GPU? + +The process is a bit more complicated on the GPU. Naively, you can just iterate throgh the array a bunch of times and sum up neighbors using an offset. The offset is 2^(d-1) where d is the pass you're doing. So starting at pass 1, youw would add i and i+1 together to be stored at i+1. At pass 2, you would add i and i+2 to be stored at i+2. Pass 3 would dd i and i+4 to be stored at i+4...until everything's been added and stored at the last index. The complexity of this is O(nlogn), compared to the CPU implementation's O(n). + +![](img/naiveParallel.PNG) + +The efficient version of the implementation is more complicated and has two steps: up-sweep and down-sweep. During up-sweep, elements in the array are added up in a balanced binary tree-like fashion, similar to what we did in the naive version but with less operations. Because we're emulating a balanced binary tree, arrays with sizes not a power of 2 must be padded out to the next power of 2. + +![](img/upSweep.PNG) + +The down-sweep is a bit less straight forward. First, you have to set the last element to 0. Then, traversing down the binary tree, you store the node's value in its left child, and store the node's value + the original value of the left child in the right child. Rinse and repeat until every leaf's been visited. + +![](img/downSweep.PNG) + +Finally, compact can be done in the same fashion after using this parallel method to compute the exclusive scan. + +## Performance Analysis + +### CPU implementation + +![](img/cpuScan.PNG) + +First up is CPU scan by itself. The algorithm operating on a power-of-two array gets slightly out performed by one on a non-power-of-two array. This makes sense because the non-power-of-two array is slightly shorter. + +![](img/cpuCompact.PNG) + +I did a little experiment with the CPU compact implementation. Normally, the algorithm iterates through the entire array once for the exclusive scan, and then does another iteration for the scatter step, resulting in two full iterations of the array. I tried implementing an "optimized" version that does both scan and scatter in a single iteration. This version proved to be faster than the normal version, though still not as fast as compaction without scan, which probably has less computations to do as it doesn't concern itself with reduction at all. + +### GPU implementation + +![](img/gpuScan.PNG) + +The graph's a hard to read as each algorithm's power-of-two and non-power-of-two version kind of overlap, but it's clear that the optimized efficient scan algorithm runs the fastest. The non-optimized efficient scan is initially slow than the naive scan, but emerges victorious after 2^16. + +The optimization here simply involves doing some careful calcuations of indices and also tweaking the number of blocks launched to increase GPU throughput. Using the regular efficient implementation, if you launch as many threads as there are elements in the array every time, a lot of the threads will end up idle. This is because up-sweep and down-sweep are essentially binary trees, so with every level traversed, the number of threads needed will half/double. Once you take that into account and only launch threads as needed, the speed increases as every thread in every wrap is now doing meaningful computation instead of just sitting idle. + +![](img/gpuCompact.PNG) + +The same speed incrased due to the optimization can also be seen with stream compaction on the GPU. + +### CPU vs. GPU vs. Thrust + +![](img/everythingScan.PNG) + +Comparing everything together, it's easy to see that naive scan is the slowest, followed by CPU scan and GPU scan. GPU scan appears slower than CPU scan for smaller array sizes, but triumphs over CPU after 2^18. Fastest of all is Thrust scan, which just uses NVIDIA's own Thrust library, built to perform operations like scan in CUDA. + +![](img/everythingCompact.PNG) + +The compact graph shows the same idea with GPU compact underperforming for smaller arrays and slowly winning over CPU compact only as the array gets larger. Surprisingly, the GPU compact is still slower than the sequential CPU compact without scan. I think this is probably because I implemented the GPU algorithm entirely using global memory instead of shared memory. The act of reading/writing to global memory is kown to be a lot slower than just using shared memory in the individual sm. Using shared memory also forces you to divide the array into smaller chunks to be computed by each block, thus leveraging spatial locality of array accesses due to the decreased size. This reason is probably why the Thrust implementation is so fast. It might even do further optimizations based on your GPU itself based on what it detects your GPU is capable of. + +## test program output for 2^20 + +``` +**************** +** SCAN TESTS ** +**************** + [ 24 6 1 13 17 31 49 31 38 39 27 43 2 ... 44 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 2.12622ms (std::chrono Measured) +==== cpu scan, non-power-of-two ==== + elapsed time: 2.88553ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 14.2575ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 14.2746ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two, optimized thread number ==== + elapsed time: 6.312ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two, optimized thread number ==== + elapsed time: 6.292ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two, unoptimized thread number ==== + elapsed time: 11.4663ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two, unoptimized thread number ==== + elapsed time: 11.4702ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.33098ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 1.9503ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 3 1 2 2 2 0 0 1 1 3 1 3 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 4.2319ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 5.80108ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + elapsed time: 5.31476ms (std::chrono Measured) + passed +==== cpu compact with scan and scatter in single loop ==== + elapsed time: 8.32711ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two, optimized thread number ==== + elapsed time: 8.73712ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two, otimized thread number ==== + elapsed time: 8.72816ms (CUDA Measured) + passed +==== work-efficient compact, power-of-two, , unoptimized thread number ==== + elapsed time: 13.9075ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two, unoptimized thread number ==== + elapsed time: 13.9254ms (CUDA Measured) + passed +``` \ No newline at end of file diff --git a/data.xlsx b/data.xlsx new file mode 100644 index 0000000..defebd1 Binary files /dev/null and b/data.xlsx differ diff --git a/img/compactExplanation.PNG b/img/compactExplanation.PNG new file mode 100644 index 0000000..8f74795 Binary files /dev/null and b/img/compactExplanation.PNG differ diff --git a/img/cpuCompact.PNG b/img/cpuCompact.PNG new file mode 100644 index 0000000..b753df2 Binary files /dev/null and b/img/cpuCompact.PNG differ diff --git a/img/cpuScan.PNG b/img/cpuScan.PNG new file mode 100644 index 0000000..f6288a2 Binary files /dev/null and b/img/cpuScan.PNG differ diff --git a/img/downSweep.PNG b/img/downSweep.PNG new file mode 100644 index 0000000..6dcf05b Binary files /dev/null and b/img/downSweep.PNG differ diff --git a/img/everythingCompact.PNG b/img/everythingCompact.PNG new file mode 100644 index 0000000..bf705bf Binary files /dev/null and b/img/everythingCompact.PNG differ diff --git a/img/everythingScan.PNG b/img/everythingScan.PNG new file mode 100644 index 0000000..0b7feeb Binary files /dev/null and b/img/everythingScan.PNG differ diff --git a/img/gpuCompact.PNG b/img/gpuCompact.PNG new file mode 100644 index 0000000..f84769c Binary files /dev/null and b/img/gpuCompact.PNG differ diff --git a/img/gpuScan.PNG b/img/gpuScan.PNG new file mode 100644 index 0000000..d0f7620 Binary files /dev/null and b/img/gpuScan.PNG differ diff --git a/img/naiveParallel.PNG b/img/naiveParallel.PNG new file mode 100644 index 0000000..6b3c751 Binary files /dev/null and b/img/naiveParallel.PNG differ diff --git a/img/upSweep.PNG b/img/upSweep.PNG new file mode 100644 index 0000000..47d2aa0 Binary files /dev/null and b/img/upSweep.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 1850161..91dd8a2 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,142 +13,311 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two +const int SIZE = 1 << 20; // feel free to change the size of array +const int NPOT = SIZE - 5; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; + +void runtimeTest(int start, int end, int interval) { + for (int i = start; i <= end; i+= interval) { + int size = 1 << i; + int npot = size - 7; + + int *array1 = new int[size]; + int *array2 = new int[size]; + int *array3 = new int[size]; + + genArray(size - 1, array1, 50); // Leave a 0 at the end to test that edge case + array1[size - 1] = 0; + + printf("---------- array size is 2^%d ----------\n", i); + printf("---------- scan ----------\n"); + + zeroArray(size, array2); + //printf("cpu scan, power-of-two: "); + StreamCompaction::CPU::scan(size, array2, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu scan, non power-of-two: "); + StreamCompaction::CPU::scan(npot, array2, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("naive scan, power-of-two"); + StreamCompaction::Naive::scan(size, array3, array1); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("naive scan, non-power-of-two"); + StreamCompaction::Naive::scan(npot, array3, array1); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient scan, power-of-two, optimized thread number"); + StreamCompaction::Efficient::scan(size, array3, array1); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient scan, non-power-of-two, optimized thread number"); + StreamCompaction::Efficient::scan(size, array3, array1); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient scan, power-of-two, unoptimized thread number"); + StreamCompaction::Efficient::scan(size, array3, array1, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient scan, non-power-of-two, unoptimized thread number"); + StreamCompaction::Efficient::scan(npot, array3, array1, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("thrust scan, power-of-two"); + StreamCompaction::Thrust::scan(size, array3, array1); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("thrust scan, non-power-of-two"); + StreamCompaction::Thrust::scan(npot, array3, array1); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + printf("---------- compact ----------\n"); + + genArray(size - 1, array1, 4); // Leave a 0 at the end to test that edge case + array1[size - 1] = 0; + + zeroArray(size, array2); + //printf("cpu compact without scan, power-of-two"); + StreamCompaction::CPU::compactWithoutScan(size, array2, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu compact without scan, non-power-of-two"); + StreamCompaction::CPU::compactWithoutScan(npot, array3, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu compact with scan, power of two"); + StreamCompaction::CPU::compactWithScan(size, array3, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu compact with scan, non-power of two"); + StreamCompaction::CPU::compactWithScan(npot, array3, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu compact with scan, power of two, optimized"); + StreamCompaction::CPU::compactWithScanOnePass(size, array3, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("cpu compact with scan, non-power of two, optimized"); + StreamCompaction::CPU::compactWithScanOnePass(npot, array3, array1); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient compact, power-of-two, optimized thread number"); + StreamCompaction::Efficient::compact(size, array3, array1); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient compact, non-power-of-two, otimized thread number"); + StreamCompaction::Efficient::compact(npot, array3, array1); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient compact, power-of-two, , unoptimized thread number"); + StreamCompaction::Efficient::compact(size, array3, array1, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + zeroArray(size, array3); + //printf("work-efficient compact, non-power-of-two, unoptimized thread number"); + StreamCompaction::Efficient::compact(npot, array3, array1, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), ""); + + printf("\n\n\n"); + + delete[] array1; + delete[] array2; + delete[] array3; + } + system("pause"); // stop Win32 console from closing on exit +} + + int main(int argc, char* argv[]) { + //runtimeTest(6, 22, 2); + // Scan tests + { + printf("\n"); + printf("****************\n"); + printf("** SCAN TESTS **\n"); + printf("****************\n"); + + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + // initialize b using StreamCompaction::CPU::scan you implement + // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. + // At first all cases passed because b && c are all zeroes. + zeroArray(SIZE, b); + printDesc("cpu scan, power-of-two"); + StreamCompaction::CPU::scan(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + //printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("cpu scan, non-power-of-two"); + StreamCompaction::CPU::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + //printArray(NPOT, b, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan, power-of-two"); + StreamCompaction::Naive::scan(SIZE, c, a); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + // For bug-finding only: Array of 1s to help find bugs in stream compaction or scan + /*onesArray(SIZE, c); + printDesc("1s array for finding bugs"); + StreamCompaction::Naive::scan(SIZE, c, a); + printArray(SIZE, c, true); */ + + zeroArray(SIZE, c); + printDesc("naive scan, non-power-of-two"); + StreamCompaction::Naive::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, power-of-two, optimized thread number"); + StreamCompaction::Efficient::scan(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, non-power-of-two, optimized thread number"); + StreamCompaction::Efficient::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, power-of-two, unoptimized thread number"); + StreamCompaction::Efficient::scan(SIZE, c, a, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan, non-power-of-two, unoptimized thread number"); + StreamCompaction::Efficient::scan(NPOT, c, a, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, power-of-two"); + StreamCompaction::Thrust::scan(SIZE, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("thrust scan, non-power-of-two"); + StreamCompaction::Thrust::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** STREAM COMPACTION TESTS **\n"); + printf("*****************************\n"); + + // Compaction tests + + genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + int count, expectedCount, expectedNPOT; + + // initialize b using StreamCompaction::CPU::compactWithoutScan you implement + // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. + zeroArray(SIZE, b); + printDesc("cpu compact without scan, power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + expectedCount = count; + //printArray(count, b, true); + printCmpLenResult(count, expectedCount, b, b); + + zeroArray(SIZE, c); + printDesc("cpu compact without scan, non-power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + expectedNPOT = count; + //printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + printDesc("cpu compact with scan"); + count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("cpu compact with scan and scatter in single loop"); + count = StreamCompaction::CPU::compactWithScanOnePass(SIZE, c, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, power-of-two, optimized thread number"); + count = StreamCompaction::Efficient::compact(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, non-power-of-two, otimized thread number"); + count = StreamCompaction::Efficient::compact(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, power-of-two, , unoptimized thread number"); + count = StreamCompaction::Efficient::compact(SIZE, c, a, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact, non-power-of-two, unoptimized thread number"); + count = StreamCompaction::Efficient::compact(NPOT, c, a, false); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + system("pause"); // stop Win32 console from closing on exit + delete[] a; + delete[] b; + delete[] c; + } } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index 46337ab..d6cc4e3 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -1,8 +1,8 @@ #pragma once -#include -#include -#include +#include +#include +#include #include #include @@ -69,8 +69,8 @@ void printArray(int n, int *a, bool abridged = false) { printf("]\n"); } -template -void printElapsedTime(T time, std::string note = "") -{ - std::cout << " elapsed time: " << time << "ms " << note << std::endl; +template +void printElapsedTime(T time, std::string note = "") +{ + std::cout << " elapsed time: " << time << "ms " << note << std::endl; } \ No newline at end of file diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..c8709e7 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_50 ) diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..23eacfa 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -18,12 +18,25 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) { namespace StreamCompaction { namespace Common { + /** + * Sets a single value in data to a certain value + */ + __global__ void kernSetIndexInData(int n, int index, int value, int *data) { + if (index >= 0 && index < n) { + data[index] = value; + } + } + + /** * Maps an array to an array of 0s and 1s for stream compaction. Elements * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + bools[index] = idata[index] != 0 ? 1 : 0; + } } /** @@ -32,7 +45,12 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 99a1b04..121c925 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -32,11 +32,15 @@ inline int ilog2ceil(int x) { namespace StreamCompaction { namespace Common { + __global__ void kernSetIndexInData(int n, int index, int value, int *data); + __global__ void kernMapToBoolean(int n, int *bools, const int *idata); __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); + static int blockSize = 128; + /** * This class is used for timing the performance * Uncopyable and unmovable diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..8cd66b3 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,15 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -19,7 +19,15 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + if (n > 0) { + odata[0] = 0; + + for (int i = 1; i < n; i++) { + odata[i] = idata[i - 1] + odata[i - 1]; + } + } + timer().endCpuTimer(); } @@ -30,9 +38,17 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int count = 0; + + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count++] = idata[i]; + } + } + timer().endCpuTimer(); - return -1; + return count; } /** @@ -42,9 +58,57 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int count = 0; + + if (n > 0) { + + // first scan + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = (idata[i - 1] != 0 ? 1 : 0) + odata[i - 1]; + } + + // then scatter + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[odata[i]] = idata[i]; + count++; + } + } + } + timer().endCpuTimer(); - return -1; + return count; } + + /** + * CPU stream compaction using scan and scatter in a single for loop + * + * @returns the number of elements remaining after compaction. + */ + int compactWithScanOnePass(int n, int *odata, const int *idata) { + timer().startCpuTimer(); + + int count = 0; + + if (n > 0) { + odata[0] = 0; + for (int i = 0; i < n; i++) { + // first scan + if (i < n - 1) { + odata[i + 1] = (idata[i] != 0 ? 1 : 0) + odata[i]; + } + // then scatter + if (idata[i] != 0) { + odata[odata[i]] = idata[i]; + count++; + } + } + } + + timer().endCpuTimer(); + return count; + } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 236ce11..4573496 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -11,5 +11,7 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + int compactWithScanOnePass(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..e3e7266 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,22 +5,172 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } +// #define DEBUG + int paddedData[1 << 16]; + + __global__ void kernUpSweep(int n, int* buffer, int interval) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index < n && index % interval == 0) { + buffer[index + interval - 1] += buffer[index + (interval >> 1) - 1]; + } + } + + __global__ void kernDownSweep(int n, int* buffer, int interval, int smallInterval) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index < n && index % interval == 0) { + int t = buffer[index + smallInterval - 1]; + buffer[index + smallInterval - 1] = buffer[index + interval - 1]; + buffer[index + interval - 1] += t; + } + } + + __global__ void kernEfficientUpSweep(int n, int* buffer, int interval) { + int index = (blockIdx.x * blockDim.x + threadIdx.x) * interval; + if (index < n) { + buffer[index + interval - 1] += buffer[index + (interval >> 1) - 1]; + } + } + + __global__ void kernEfficientDownSweep(int n, int* buffer, int interval, int smallInterval) { + int index = (blockIdx.x * blockDim.x + threadIdx.x) * interval; + if (index < n) { + int t = buffer[index + smallInterval - 1]; + buffer[index + smallInterval - 1] = buffer[index + interval - 1]; + buffer[index + interval - 1] += t; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + void scan(int n, int *odata, const int *idata, bool efficient) { + // allocate 2 arrays on global memory. one original. one resized. + int* dev_padded; + + int logn = ilog2ceil(n); + int paddedSize = 1 << logn; + size_t originalSizeInBytes = n * sizeof(int); + size_t paddedSizeInBytes = paddedSize * sizeof(int); + + cudaMalloc((void**)&dev_padded, paddedSizeInBytes); + checkCUDAError("cudaMalloc dev_padded failed!"); + + // initialize dev_padded to 0 and copy idata into it + cudaMemset(dev_padded, 0, paddedSizeInBytes); + checkCUDAError("cudaMemset failed"); + cudaMemcpy(dev_padded, idata, originalSizeInBytes, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcopy from idata to dev_original failed!"); + + timer().startGpuTimer(); + +#ifdef DEBUG + printf("before start: ["); + cudaMemcpy(paddedData, dev_padded, paddedSizeInBytes, cudaMemcpyDeviceToHost); + for (int i = 0; i < paddedSize; i++) { + printf("%d, ", paddedData[i]); + } + printf("] \n"); +#endif + + scanHelper(paddedSize, logn, dev_padded, efficient); + + timer().endGpuTimer(); + + + // copy padded data back into odata + cudaMemcpy(odata, dev_padded, originalSizeInBytes, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcopy from dev_original to odata failed!"); + + // free the allocated arrays + cudaFree(dev_padded); + checkCUDAError("cudaFree on dev_padded failed"); } + + /** + * Helper function for scan that does the upsweeps and downsweeps + */ + void scanHelper(int n, int logn, int* dev_buffer, bool efficient) { + + if (efficient) { + for (int i = 0; i <= logn - 1; i++) { + int interval = 1 << (i + 1); + dim3 numBlocks(((n >> (i + 1)) + Common::blockSize + 1) / Common::blockSize); + kernEfficientUpSweep << < numBlocks, Common::blockSize >> > (n, dev_buffer, interval); + checkCUDAError("kernEfficientUpSweep failed!"); + } + } + else { + dim3 fullBlocksPerGrid((n + Common::blockSize - 1) / Common::blockSize); + + for (int i = 0; i <= logn - 1; i++) { + int interval = 1 << (i + 1); + kernUpSweep << < fullBlocksPerGrid, Common::blockSize >> > (n, dev_buffer, interval); + checkCUDAError("kernEfficientUpSweep failed!"); + } + } + +#ifdef DEBUG + printf("after up sweep: ["); + cudaMemcpy(paddedData, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { + printf("%d, ", paddedData[i]); + } + printf("] \n"); +#endif + + // first set the last value to 0 + cudaMemset(dev_buffer + n - 1, 0, sizeof(int)); + checkCUDAError("kernSetIndexInData failed!"); + +#ifdef DEBUG + printf("after setting last value: ["); + cudaMemcpy(paddedData, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { + printf("%d, ", paddedData[i]); + } + printf("] \n"); +#endif + + // down sweep + if (efficient) { + for (int i = logn - 1; i >= 0; i--) { + int smallInterval = 1 << i; + int interval = 1 << (i + 1); + dim3 numBlocks(((n >> (i + 1)) + Common::blockSize + 1) / Common::blockSize); + kernEfficientDownSweep << < numBlocks, Common::blockSize >> > (n, dev_buffer, interval, smallInterval); + checkCUDAError("kernEfficientDownSweep failed!"); + } + } + else { + dim3 fullBlocksPerGrid((n + Common::blockSize - 1) / Common::blockSize); + + for (int i = logn - 1; i >= 0; i--) { + int smallInterval = 1 << i; + int interval = 1 << (i + 1); + kernDownSweep << < fullBlocksPerGrid, Common::blockSize >> > (n, dev_buffer, interval, smallInterval); + checkCUDAError("kernEfficientDownSweep failed!"); + } + } + +#ifdef DEBUG + printf("after downsweep: ["); + cudaMemcpy(paddedData, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { + printf("%d, ", paddedData[i]); + } + printf("] \n"); +#endif + } + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -30,11 +180,92 @@ namespace StreamCompaction { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + int compact(int n, int *odata, const int *idata, bool efficient) { + if (n <= 0) return -1; + + int count = 0; + + // ---------------------------------------------- + // ---------- allocate global memory ------------ + // ---------------------------------------------- + + int *dev_input, *dev_output, *dev_bools, *dev_indices; + int logn = ilog2ceil(n); + int paddedSize = 1 << logn; + size_t originalSizeInBytes = n * sizeof(int); + size_t paddedSizeInBytes = paddedSize * sizeof(int); + dim3 fullBlocksPerGrid((paddedSize + Common::blockSize - 1) / Common::blockSize); + + cudaMalloc((void**)&dev_input, paddedSizeInBytes); + checkCUDAError("cudaMalloc dev_input failed!"); + cudaMalloc((void**)&dev_output, paddedSizeInBytes); + checkCUDAError("cudaMalloc dev_output failed!"); + cudaMalloc((void**)&dev_bools, paddedSizeInBytes); + checkCUDAError("cudaMalloc dev_bools failed!"); + cudaMalloc((void**)&dev_indices, paddedSizeInBytes); + checkCUDAError("cudaMalloc dev_indices failed!"); + + // ---------------------------------------------------- + // ---------- copy data into global memory ------------ + // ---------------------------------------------------- + + // set dev_input to 0 then copy idata into it + cudaMemset(dev_input, 0, paddedSizeInBytes); + checkCUDAError("cudaMemset failed"); + cudaMemcpy(dev_input, idata, originalSizeInBytes, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcopy from idata to dev_original failed!"); + + // --------------------------------------- + // ---------- start algorithm ------------ + // --------------------------------------- + + timer().startGpuTimer(); + + // turn input into boolean array + Common::kernMapToBoolean << > > (paddedSize, dev_bools, dev_input); + checkCUDAError("kernMapToBoolean failed!"); + + // exclusive scan the boolean array + cudaMemcpy(dev_indices, dev_bools, paddedSizeInBytes, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcopy from dev_bools to dev_indices failed!"); + scanHelper(paddedSize, logn, dev_indices, efficient); + + // scatter + Common::kernScatter << > > (paddedSize, dev_output, dev_input, dev_bools, dev_indices); + checkCUDAError("kernScatter failed!"); + + timer().endGpuTimer(); + + + // ----------------------------------------------------------- + // ---------- read data from global memory ------------ + // ----------------------------------------------------------- + + // first, copy dev_bool into odata to get the count of non-zero elements + cudaMemcpy(odata, dev_bools, originalSizeInBytes, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcopy from dev_original to odata failed!"); + for (int i = 0; i < n; i++) { + if (odata[i] != 0) count++; + } + + // finally, copy dev_output into odata + cudaMemcpy(odata, dev_output, originalSizeInBytes, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcopy from dev_original to odata failed!"); + + // ------------------------------------------ + // ---------- free global memory ------------ + // ------------------------------------------ + + cudaFree(dev_input); + checkCUDAError("cudaFree on dev_input failed"); + cudaFree(dev_output); + checkCUDAError("cudaFree on dev_output failed"); + cudaFree(dev_bools); + checkCUDAError("cudaFree on dev_bools failed"); + cudaFree(dev_indices); + checkCUDAError("cudaFree on dev_indices failed"); + + return count; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..419bf2c 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,8 +6,10 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, bool efficient = true); - int compact(int n, int *odata, const int *idata); + void scanHelper(int n, int logn, int* dev_buffer, bool efficient); + + int compact(int n, int *odata, const int *idata, bool efficient = true); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..dfd2368 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -5,21 +5,90 @@ namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } - // TODO: __global__ + __global__ void kernNaiveScan(int n, int* input, int* output, int interval) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + if (index < interval) { + output[index] = input[index]; + } + else if (index >= interval) { + output[index] = input[index - interval] + input[index]; + } + } + } + + __global__ void kernMakeExclusive(int n, int* input, int* output) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n){ + if (index > 0) { + output[index] = input[index - 1]; + } + else if (index == 0){ + output[index] = 0; + } + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO + if (n <= 0) return; + + // allocate 2 arrays on global memory + int* dev_buffer1; + int* dev_buffer2; + size_t sizeInBytes = n * sizeof(int); + + cudaMalloc((void**)&dev_buffer1, sizeInBytes); + checkCUDAError("cudaMalloc dev_buffer1 failed!"); + + cudaMalloc((void**)&dev_buffer2, sizeInBytes); + checkCUDAError("cudaMalloc dev_buffer2 failed!"); + + // copy the data into global memory + cudaMemcpy(dev_buffer1, idata, sizeInBytes, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcopy from idata to dev_buffer1 failed!"); + cudaMemcpy(dev_buffer2, dev_buffer1, sizeInBytes, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcopy from dev_buffer1 to dev_buffer2 failed!"); + + dim3 fullBlocksPerGrid((n + Common::blockSize - 1) / Common::blockSize); + + timer().startGpuTimer(); + + int ceiling = ilog2ceil(n); + int** input = &dev_buffer1; + int** output = &dev_buffer2; + + for (int i = 1; i <= ceiling; i++) { + int interval = 1 << (i - 1); + kernNaiveScan << < fullBlocksPerGrid, Common::blockSize >> > (n, *input, *output, interval); + checkCUDAError("kernNaiveScan failed"); + std::swap(input, output); + + } + + // shift the output array to the right by 1 + kernMakeExclusive << < fullBlocksPerGrid, Common::blockSize >> > (n, *input, *output); + checkCUDAError("kernMakeExclusive failed"); + timer().endGpuTimer(); + + // copy data back + cudaMemcpy(odata, *output, sizeInBytes, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcopy from output to odata failed!"); + + // free the allocated arrays + cudaFree(dev_buffer1); + checkCUDAError("cudaFree on dev_buffer1 failed"); + cudaFree(dev_buffer2); + checkCUDAError("cudaFree on dev_buffer2 failed"); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..d1ad099 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,26 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out(n, 0); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + timer().endGpuTimer(); + + cudaMemcpy(odata, dv_out.data().get(), n * sizeof(int), cudaMemcpyDeviceToHost); } } }