diff --git a/README.md b/README.md index 0e38ddb..f0d22f9 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,271 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +## SIREESHA PUTCHA + +* LinkedIn [ LinkedIn ](https://www.linkedin.com/in/sireesha-putcha/) -### (TODO: Your README) +* Fb [ Facebook ](https://www.facebook.com/sireesha.putcha98/) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +* Portfolio [ Portfolio ](https://sites.google.com/view/sireeshaputcha/home) +* Mail [ Mail ](sireesha@seas.upenn.edu) + + +* Tested on personal computer - Microsoft Windows 10 Pro, +Processor : Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, 2601 Mhz, 6 Core(s), 12 Logical Processor(s) + +GPU : NVIDIA GeForce RTX 2060 + +## OUTPUT +* SCAN + +![Scan Functions output](img/scanres.png) + +* STREAM COMPACTION + +![Stream Compaction output](img/scres.png) + +# ALGORITHM OVERVIEW + +GPU stream compaction in CUDA from scratch. This algorithm is widely used. + +Stream compaction implementations in this project will simply remove `0`s +from an array of `int`s. + +This project is meant to reorient our algorithmic thinking to the way of the GPU. On GPUs, many +algorithms can benefit from massive parallelism and, in particular, data +parallelism: executing the same code many times simultaneously with different +data. + +Implementations include a few different versions of the *Scan* (*Prefix Sum*) +algorithm. First, is an implemention of a CPU version of the algorithm. Then, there are the GPU implementations: "naive" and +"work-efficient." Finally, these implementations are used in GPU stream compaction. + +**References** +* The [slides on Parallel Algorithms](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p126) + for Scan, Stream Compaction, and Work-Efficient Parallel Scan. +* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html). + - This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.) + - We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). +* [Algorithm Examples](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#algorithm-examples). +* [Recitation slides](https://docs.google.com/presentation/d/1daOnWHOjMp1sIqMdVsNnvEU1UYynKcEMARc_W6bGnqE/edit?usp=sharing) + +The GPU stream compaction implementation lives inside of the `stream_compaction` subproject. . + + +## Scan : +This is also called All Prefix Sum. + +![All prefix sums](img/allprefixsums.png) + +There are 2 types of scans : + +* Exclusive scan : Elem j of the result does not include elem j of input. + +* Inclusive Scan : All elements including j are summed up and output in the j element. + +![Exclusive and Inclusive Scans](img/exandinc.png) + +### Ways to implement Scan + +1) Naive Parallel Scan: In this, each thread writes one sum and reads two values + +![Naive Scan](img/naive.png) + +2) Work Efficient Parallel Scan : This scan uses a balanced binary tree concept. There are 2 phases in this scan. + +- Upsweep : This phase is similar to parallel reduction. The sum of all elements is stored in the last element. + +![Upsweep](img/upsweep.png) + +- Downsweep : Traverse back down the tree using partial sums. + +![Downsweep](img/downsweep.png) + +## Stream Compaction: +Given an array, create a new array of elements that meet a certain criteria. There are 3 steps in this process: + +1) Compute a temporary array with 0s and 1s based on the given condition. + +2) Run and exclusive scan on this temporary array. + +3) Scatter the elements. Use the result of scan as an index to write to the final output array if the value for that element is 1 in the temporary array. + +![SC](img/sc.png) + +## Algorithm Examples + +* Scan: + - goal: produce a prefix sum array of a given array (we only care about exclusive scan here) + - input + - [1 5 0 1 2 0 3] + - output + - [0 1 6 6 7 9 9] +* Stream Compaction: + - goal: closely and neatly packed the elements != 0 + - input + - [1 5 0 1 2 0 3] + - output + - [1 5 1 2 3] +* compactWithoutScan (CPU) + - an implementation of compact. So the goal, input and output should all be the same as compact + - Simply loop through the input array, meanwhile maintain a pointer indicating which address shall we put the next non-zero element +* compactWithScan (CPU/GPU) + - an implementation of compact. So the goal, input and output should all be the same as compact + - 3 steps + - map + + goal: map our original data array (integer, Light Ray, etc) to a bool array + + input + - [1 5 0 1 2 0 3] + + output + - [1 1 0 1 1 0 1] + - scan + + take the output of last step as input + + input + - [1 1 0 1 1 0 1] + + output + - [0 1 2 2 3 4 4] + - scatter + + preserve non-zero elements and compact them into a new array + + input: + + original array + - [1 5 0 1 2 0 3] + + mapped array + - [1 1 0 1 1 0 1] + + scanned array + - [0 1 2 2 3 4 4] + + output: + - [1 5 1 2 3] + + This can be done in parallel on GPU + + You can try multi-threading on CPU if you want (not required and not our focus) + + for each element input[i] in original array + - if it's non-zero (given by mapped array) + - then put it at output[index], where index = scanned[i] + +# PROJECT OVERVIEW + +1) CPU Scan and Stream compaction Implementation +`stream_compaction/cpu.cu`: + +* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum. For performance comparison. +* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using + the `scan` function. +* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan` + function. Map the input array to an array of 0s and 1s, scan it, and use + scatter to produce the output. + +2) Thrust Implementation for Scan : + +Performed a scan using thrust::exclusive_scan(first, last, result) + +3) Naive GPU Scan : + +Finding the exclusive scan for the given elements in an array. Each time, we increment the loop by raising the depth by a power of 2. +We compute the inclusive scan first by adding the previous output in the array to the previous input element to compute the result for the current element. +We then convert this to an exclusive scan by setting the first element in the output array to zero and shifting the rest of the elements to the right. +We perform this by using ping-pong buffers. + +4) Work Efficient GPU Scan : + +We perform a work efficient scan on the gpu based on the binary tree model discussed above. We first pad zeros to the end of an array if it is not the size of power of 2. +Then we perform Upsweep which is equivalent to parallel reduction. We set the last element to zero before performing downsweep where we traverse down the tree. + +# PERFORMANCE ANALYSIS + +## Analysis + +* Performance comparision between GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan with BlockSize on the x axis. + Array Size used: 2^12, time is in milliseconds. (Lower is better). The optimal Blocksize is 128 for my system. + +![Scan Implementations Performance Block Comparison Bar chart](img/graph_blocksizecomp.png) + +* Performance comparision between GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan with array size on the x axis. + BlockSize used :128, time is in milliseconds. + +![Scan Implementations Performance Arr Comparison Line Graph](img/graph_scancomp.png) + +![Scan Implementations Performance Arr Comparison Bar chart](img/bar_scancomp.png) + +* Write a brief explanation of the phenomena you see here. + +As we compare the four different approaches(Naive, Work-Efficient, Thrust and CPU), it is evident that the thrust method gives the best results. +The cpu implementation works very efficiently for smaller sizes of input array but +as the size gets bigger (2^19), we see that it gives the worst performance of the four. Thrust is very consistent with the different input array sized used. +It gives a 0.003-0.004 second output even with an array size of 2^19. The naive implementation gives better results compared to the work efficient implementation for +smaller array sizes but the work efficient is best for bigger array sizes. This is because of the fact that with smaller array sizes, a lot of threads are unused but as input +size increases, it becomes more efficient. As for Naive, the complexity is increasing since the amount of additions done are doubling with the input size since the time complexity +is of the order O(nlog(n)). + +* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is + it different for each implementation? + +A performance bottleneck for the GPU implementations in general is the usage of global memory. Global memory read/writes are expensive and take upto 20 cycles per instruction. +An optimization that can be used in this case is shared memory. Instead of creating device buffers on global memory, if we store the input data in shared memory (for which read/write take only 2 cycles), +we can see an improvement in the time taken for work efficient scan. For Naive, the additional bottleneck would be increase in the size of input array. + +## Output + +Array Size used : 2^12. Blocksize used 128. + +``` +**************** +** SCAN TESTS ** +**************** + [ 6 33 12 21 45 1 15 6 16 34 19 17 46 ... 47 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0089ms (std::chrono Measured) + [ 0 6 39 51 72 117 118 133 139 155 189 208 225 ... 100600 100647 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0087ms (std::chrono Measured) + [ 0 6 39 51 72 117 118 133 139 155 189 208 225 ... 100491 100522 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.06496ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.063104ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.129024ms (CUDA Measured) + [ 0 6 39 51 72 117 118 133 139 155 189 208 225 ... 100600 100647 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.129024ms (CUDA Measured) + [ 0 6 39 51 72 117 118 133 139 155 189 208 225 ... 100491 100522 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.003328ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.00336ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 0 2 2 2 3 2 3 2 2 1 3 3 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0197ms (std::chrono Measured) + [ 3 2 2 2 3 2 3 2 2 1 3 3 3 ... 2 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0195ms (std::chrono Measured) + [ 3 2 2 2 3 2 3 2 2 1 3 3 3 ... 2 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0535ms (std::chrono Measured) + [ 3 2 2 2 3 2 3 2 2 1 3 3 3 ... 2 2 ] + passed +==== work-efficient compact, power-of-two ==== +Work Efficient SC count is 3091 + elapsed time: 0.374784ms (CUDA Measured) + [ 3 2 2 2 3 2 3 2 2 1 3 3 3 ... 2 2 ] + passed +==== work-efficient compact, non-power-of-two ==== +Work Efficient SC count is 3091 + elapsed time: 0.433312ms (CUDA Measured) + [ 3 2 2 2 3 2 3 2 2 1 3 3 3 ... 2 2 ] + passed +``` + diff --git a/img/Logos/chat.png b/img/Logos/chat.png new file mode 100644 index 0000000..622d9b5 Binary files /dev/null and b/img/Logos/chat.png differ diff --git a/img/Logos/facebook.png b/img/Logos/facebook.png new file mode 100644 index 0000000..14d5f18 Binary files /dev/null and b/img/Logos/facebook.png differ diff --git a/img/Logos/linkedin.png b/img/Logos/linkedin.png new file mode 100644 index 0000000..3fbdd62 Binary files /dev/null and b/img/Logos/linkedin.png differ diff --git a/img/Logos/mail.png b/img/Logos/mail.png new file mode 100644 index 0000000..a3faa8a Binary files /dev/null and b/img/Logos/mail.png differ diff --git a/img/algvals.png b/img/algvals.png new file mode 100644 index 0000000..b77aecc Binary files /dev/null and b/img/algvals.png differ diff --git a/img/allprefixsums.png b/img/allprefixsums.png new file mode 100644 index 0000000..e536e1d Binary files /dev/null and b/img/allprefixsums.png differ diff --git a/img/bar_scancomp.png b/img/bar_scancomp.png new file mode 100644 index 0000000..e9ff4da Binary files /dev/null and b/img/bar_scancomp.png differ diff --git a/img/blocksizevals.png b/img/blocksizevals.png new file mode 100644 index 0000000..1f4c899 Binary files /dev/null and b/img/blocksizevals.png differ diff --git a/img/downsweep.png b/img/downsweep.png new file mode 100644 index 0000000..5de0d50 Binary files /dev/null and b/img/downsweep.png differ diff --git a/img/exandinc.png b/img/exandinc.png new file mode 100644 index 0000000..98b1263 Binary files /dev/null and b/img/exandinc.png differ diff --git a/img/graph_blocksizecomp.png b/img/graph_blocksizecomp.png new file mode 100644 index 0000000..68a2545 Binary files /dev/null and b/img/graph_blocksizecomp.png differ diff --git a/img/graph_scancomp.png b/img/graph_scancomp.png new file mode 100644 index 0000000..15ecc24 Binary files /dev/null and b/img/graph_scancomp.png differ diff --git a/img/naive.png b/img/naive.png new file mode 100644 index 0000000..dabb2fc Binary files /dev/null and b/img/naive.png differ diff --git a/img/sc.png b/img/sc.png new file mode 100644 index 0000000..464ea98 Binary files /dev/null and b/img/sc.png differ diff --git a/img/scanres.png b/img/scanres.png new file mode 100644 index 0000000..dc28ca1 Binary files /dev/null and b/img/scanres.png differ diff --git a/img/scres.png b/img/scres.png new file mode 100644 index 0000000..b0641f9 Binary files /dev/null and b/img/scres.png differ diff --git a/img/upsweep.png b/img/upsweep.png new file mode 100644 index 0000000..2aba540 Binary files /dev/null and b/img/upsweep.png differ diff --git a/img/wescan.png b/img/wescan.png new file mode 100644 index 0000000..fa695f1 Binary files /dev/null and b/img/wescan.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..9282003 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,9 +11,12 @@ #include #include #include +#include +#include #include "testing_helpers.hpp" +//#define SHARED_MEMORY 1 -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -71,14 +74,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); @@ -137,16 +140,35 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + +#ifdef SHARED_MEMORY + //Added by Sireesha + zeroArray(SIZE, c); + printDesc("work-efficient compact using Shared Memory, power-of-two"); + count = StreamCompaction::Efficient::compactSharedMemory(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact using Shared Memory, non-power-of-two"); + count = StreamCompaction::Efficient::compactSharedMemory(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + +#endif SHARED_MEMORY + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..97d4830 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,18 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + if (idata[index] != 0) + { + bools[index] = 1; + } + else + { + bools[index] = 0; + } } /** @@ -33,6 +45,15 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } + } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..f459f3e 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -4,47 +4,135 @@ #include "common.h" namespace StreamCompaction { - namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * CPU scan (prefix sum). - * For performance analysis, this is supposed to be a simple for loop. - * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. - */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - } - - /** - * CPU stream compaction without using the scan function. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - - /** - * CPU stream compaction using scan and scatter, like the parallel version. - * - * @returns the number of elements remaining after compaction. - */ - int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; - } - } + namespace CPU { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + /** + * CPU scan (prefix sum). + * For performance analysis, this is supposed to be a simple for loop. + * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. + */ + void scan(int n, int* odata, const int* idata) { + timer().startCpuTimer(); + // TODO + //Exclusive Scan + if (odata && idata && n > 0) + { + odata[0] = 0; + for (int i = 1; i < n; ++i) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } + timer().endCpuTimer(); + } + + /** + * CPU stream compaction without using the scan function. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithoutScan(int n, int* odata, const int* idata) { + timer().startCpuTimer(); + // TODO + int count = 0; + if (odata && idata && n > 0) + { + for (int i = 0; i < n; ++i) + { + if (idata[i] != 0) + { + odata[count] = idata[i]; + count++; + } + } + } + timer().endCpuTimer(); + return count; + } + + //Scan Helper function for compactWithScan + void scanHelper(int n, int* odata, const int* idata) { + // TODO + //Exclusive Scan + if (odata && idata && n > 0) + { + odata[0] = 0; + for (int i = 1; i < n; ++i) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } + } + + /** + * CPU stream compaction using scan and scatter, like the parallel version. + * + * @returns the number of elements remaining after compaction. + */ + int compactWithScan(int n, int* odata, const int* idata) { + timer().startCpuTimer(); + + // TODO + int count = 0; + int* tempArray = new int[n]; + int* scanResult = new int[n]; + + if (odata && idata && n > 0) + { + //Compute temp array with 1s and 0s + for (int i = 0; i < n; ++i) + { + if (idata[i]) + { + tempArray[i] = 1; + } + else + { + tempArray[i] = 0; + } + } + + //Scan + scanHelper(n, scanResult, tempArray); + + + //printf("CPU Steam Compaction scanResult Output is ["); + //for (int i = 0; i < count; i++) + //{ + // printf(" %d ", scanResult[i]); + //} + //printf("] \n"); + + //Scatter + for (int i = 0; i < n; ++i) + { + if (tempArray[i] == 1) + { + odata[scanResult[i]] = idata[i]; + count++; + } + } + + //printf("CPU SC Count is %d \n", count); + //printf("CPU Steam Compaction Output is ["); + //for (int i = 0; i < count; i++) + //{ + // printf(" %d ", odata[i]); + //} + //printf("] \n"); + } + timer().endCpuTimer(); + + delete[] tempArray; + delete[] scanResult; + + return count++; + } + } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..8e7ac9b 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "efficient.h" +#define GLM_FORCE_CUDA +#define blockSize 128 +#define SHARED_MEMORY 1 namespace StreamCompaction { namespace Efficient { @@ -12,15 +15,223 @@ namespace StreamCompaction { return timer; } + //For scan + int* dev; + + //For Stream compaction + int* dev_SC; + int* dev_tempArray; + int* dev_scanResult; + int* dev_finalData; + + + //-----------------------------------------------------------------// + //-----------------------SCAN HELPERS -----------------------------// + //-----------------------------------------------------------------// + + __global__ void kernFillExtraZeros(int POT, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + if (index > n && index < POT) + { + idata[index] = 0; + } + } + + __global__ void kernUpsweep(int d, int offset, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (index < d) + { + int a = offset * (2 * index + 1) - 1; + int b = offset * (2 * index + 2) - 1; + idata[b] += idata[a]; + } + + } + + __global__ void kernClearLastElem(int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (index == 0) + { + idata[n - 1] = 0; + } + } + + __global__ void kernDownsweep(int d, int offset, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + if (index < d) + { + int a = offset * (2 * index + 1) - 1; + int b = offset * (2 * index + 2) - 1; + float t = idata[a]; + idata[a] = idata[b]; + idata[b] += t; + } + + } + + //--------------------------------------------------------// + //-----------------------SCAN ----------------------------// + //--------------------------------------------------------// + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata) { + // TODO + int POT = 0; + int N = 0; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + //If n is not a power of 2, get closest power of 2 and fill the rest of the array with zeroes. + if (((1 << ilog2ceil(n)) - n) > 0) + { + POT = 1 << ilog2ceil(n); + cudaMalloc((void**)&dev, POT * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev failed!"); + + cudaMemcpy(dev, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev failed!"); + + kernFillExtraZeros << > > (POT, n, dev); + N = POT; + } + //Else just allocate the device buffer on the gpu + else + { + cudaMalloc((void**)&dev, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev failed!"); + + cudaMemcpy(dev, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev failed!"); + N = n; + } + timer().startGpuTimer(); - // TODO + + //Upsweep + int offset = 1; + for (int d = N >> 1; d > 0; d >>= 1) + { + kernUpsweep << > > (d, offset, N, dev); + checkCUDAErrorFn("kernUpsweep failed!"); + offset *= 2; + } + + //Clear Last Element + kernClearLastElem << > > (N, dev); + checkCUDAErrorFn("kernClearLastElem failed!"); + + //DownSweep + //traverse down tree & build scan + for (int d = 1; d < N; d *= 2) + { + offset >>= 1; + kernDownsweep << > > (d, offset, N, dev); + checkCUDAErrorFn("kernDownsweep failed!"); + } timer().endGpuTimer(); + + //Write back values to host from device + cudaMemcpy(odata, dev, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy dev to odata failed!"); + + cudaFree(dev); + + } + + __global__ void kernComputeTempArray(int n, int* idata, int* tempArray) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (idata[index] != 0) + { + tempArray[index] = 1; + } + else + { + tempArray[index] = 0; + } + } + + void scanHelper(int n, int* odata, const int* idata) { + // TODO + int POT = 0; + int N = 0; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + //If n is not a power of 2, get closest power of 2 and fill the rest of the array with zeroes. + if (((1 << ilog2ceil(n)) - n) > 0) + { + POT = 1 << ilog2ceil(n); + cudaMalloc((void**)&dev, POT * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev failed!"); + + cudaMemcpy(dev, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev failed!"); + + kernFillExtraZeros << > > (POT, n, dev); + N = POT; + } + //Else just allocate the device buffer on the gpu + else + { + cudaMalloc((void**)&dev, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev failed!"); + + cudaMemcpy(dev, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev failed!"); + N = n; + } + + //Upsweep + int offset = 1; + for (int d = N >> 1; d > 0; d >>= 1) + { + kernUpsweep << > > (d, offset, N, dev); + checkCUDAErrorFn("kernUpsweep failed!"); + offset *= 2; + } + + //Clear Last Element + kernClearLastElem << > > (N, dev); + checkCUDAErrorFn("kernClearLastElem failed!"); + + //DownSweep + //traverse down tree & build scan + for (int d = 1; d < N; d *= 2) + { + offset >>= 1; + kernDownsweep << > > (d, offset, N, dev); + checkCUDAErrorFn("kernDownsweep failed!"); + } + + //Write back values to host from device + cudaMemcpy(odata, dev, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy dev to odata failed!"); + + cudaFree(dev); } + //----------------------------------------------------------// + //-----------------------COMPACT----------------------------// + //----------------------------------------------------------// + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -30,11 +241,318 @@ namespace StreamCompaction { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ + int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + // TODO + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + cudaMalloc((void**)&dev_SC, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_SC failed!"); + + cudaMalloc((void**)&dev_tempArray, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_tempArray failed!"); + + cudaMalloc((void**)&dev_scanResult, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scanResult failed!"); + + cudaMalloc((void**)&dev_finalData, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_finalData failed!"); + + //Will these be needed? + int* tempArray = new int[n]; + int* scanResult = new int[n]; + + cudaMemcpy(dev_SC, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy idata to dev_SC failed!"); + + timer().startGpuTimer(); + + //Compute a temp array with 1s and 0s + Common::kernMapToBoolean << > > (n, dev_tempArray, dev_SC); + checkCUDAErrorFn("kernComputeTempArray failed!"); + + //Copy the tempArray to the host Temparray + cudaMemcpy(tempArray, dev_tempArray, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy tempArray to dev_tempArray failed!"); + + //Scan + scanHelper(n, scanResult, tempArray); + + //Copy the scan result into the device scanResult array + cudaMemcpy(dev_scanResult, scanResult, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy scanResult to dev_scanResult failed!"); + + //Scatter + Common::kernScatter << > > (n, dev_finalData, dev_SC, dev_tempArray, dev_scanResult); + checkCUDAErrorFn("kernScatter scanResult to dev_scanResult failed!"); + timer().endGpuTimer(); - return -1; + + int count = 0; + if (idata[n - 1] != 0) + { + count = scanResult[n - 1] + 1; + } + else + { + count = scanResult[n - 1]; + } + + //Copy data back from device to host + cudaMemcpy(odata, dev_finalData, count * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy dev_finalData to odata failed!"); + + printf("Work Efficient SC count is %d \n", count); + //printf("Work Efficient Steam Compaction Output is ["); + //for (int i = 0; i < 15; i++) + //{ + // if (odata[i]) + // { + // printf(" %d ,", odata[i]); + // } + //} + //printf("] \n"); + + //Delete the heap allocated host memory and free the cuda device memory + delete[] tempArray; + delete[] scanResult; + cudaFree(dev_SC); + cudaFree(dev_tempArray); + cudaFree(dev_scanResult); + cudaFree(dev_finalData); + + return count; } + + //-----------------------------------------------------------------------------// + //-----------------------S-H-A-R-E-D--M-E-M-O-R-Y------------------------------// + //-----------------------------------------------------------------------------// + +#ifdef SHARED_MEMORY + + __device__ void dev_kernFillExtraZeros(int POT, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + if (index > n && index < POT) + { + idata[index] = 0; + } + } + + __device__ void dev_kernUpsweep(int d, int offset, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (index < d) + { + int a = offset * (2 * index + 1) - 1; + int b = offset * (2 * index + 2) - 1; + idata[b] += idata[a]; + } + + } + + __device__ void dev_kernClearLastElem(int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (index == 0) + { + idata[n - 1] = 0; + } + } + + __device__ void dev_kernDownsweep(int d, int offset, int n, int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + if (index < d) + { + int a = offset * (2 * index + 1) - 1; + int b = offset * (2 * index + 2) - 1; + float t = idata[a]; + idata[a] = idata[b]; + idata[b] += t; + } + + } + + __device__ void dev_kernWriteToSharedMemory(int n, int* shMem, const int* ip) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + shMem[index] = ip[index]; + } + + __device__ void dev_kernWriteFromSharedMemory(int n, int* shMem, int* op) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + op[index] = shMem[index]; + } + + __device__ int ilog2(int x) { + int lg = 0; + while (x >>= 1) { + ++lg; + } + return lg; + } + + __device__ int ilog2ceil(int x) { + return x == 1 ? 0 : ilog2(x - 1) + 1; + } + + __global__ void kernScanHelperSharedMemory(int n, int* odata, const int* idata) + { + int POT = 0; + int N = 0; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + extern __shared__ int temp[]; + + //Write to Shared Memory + dev_kernWriteToSharedMemory(n, temp, idata); + + //If n is not a power of 2, get closest power of 2 and fill the rest of the array with zeroes. + if (((1 << ilog2ceil(n)) - n) > 0) + { + POT = 1 << ilog2ceil(n); + dev_kernFillExtraZeros(POT, n, temp); + N = POT; + } + else + { + N = n; + } + + //Upsweep + int offset = 1; + for (int d = N >> 1; d > 0; d >>= 1) + { + __syncthreads(); + dev_kernUpsweep(d, offset, N, temp); + offset *= 2; + } + + //Clear Last Element + dev_kernClearLastElem(N, temp); + + //DownSweep + //traverse down tree & build scan + for (int d = 1; d < N; d *= 2) + { + offset >>= 1; + __syncthreads(); + dev_kernDownsweep(d, offset, N, temp); + } + __syncthreads(); + + //Write back values to host from device + dev_kernWriteFromSharedMemory(n, temp, odata); + + } + + //Work efficient Stream Compaction using shared memory + int compactSharedMemory(int n, int* odata, const int* idata) + { + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + cudaMalloc((void**)&dev_SC, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_SC failed!"); + + cudaMalloc((void**)&dev_tempArray, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_tempArray failed!"); + + cudaMalloc((void**)&dev_scanResult, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scanResult failed!"); + + cudaMalloc((void**)&dev_finalData, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_finalData failed!"); + + + cudaMemcpy(dev_SC, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy idata to dev_SC failed!"); + + timer().startGpuTimer(); + + //Compute a temp array with 1s and 0s + Common::kernMapToBoolean << > > (n, dev_tempArray, dev_SC); + checkCUDAErrorFn("kernComputeTempArray failed!"); + + //Copy the tempArray to the host Temparray + //cudaMemcpy(tempArray, dev_tempArray, n * sizeof(int), cudaMemcpyDeviceToHost); + //checkCUDAErrorFn("cudaMemcpy tempArray to dev_tempArray failed!"); + + //Scan + //Calling a device function with __global__ signature + kernScanHelperSharedMemory << > > (n, dev_scanResult, dev_tempArray); + checkCUDAErrorFn("kernScanHelperSharedMemory failed!"); + + //Copy the scan result into the device scanResult array + //cudaMemcpy(dev_scanResult, scanResult, n * sizeof(int), cudaMemcpyHostToDevice); + //checkCUDAErrorFn("cudaMemcpy scanResult to dev_scanResult failed!"); + + //Scatter + Common::kernScatter << > > (n, dev_finalData, dev_SC, dev_tempArray, dev_scanResult); + checkCUDAErrorFn("kernScatter scanResult to dev_scanResult failed!"); + + timer().endGpuTimer(); + + int* scanRes = new int[n]; + cudaMemcpy(scanRes, dev_scanResult, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemCpy dev_scanres to scanres failed!"); + + printf("\n Scan Result is : \n ["); + for (int i = 0; i < n; i++) + { + printf("%d ,", scanRes[i]); + } + printf("] \n"); + + int count = 0; + if (idata[n - 1] != 0) + { + count = dev_scanResult[n - 1] + 1; + } + else + { + count = dev_scanResult[n - 1]; + } + + //Copy data back from device to host + cudaMemcpy(odata, dev_finalData, count * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy dev_finalData to odata failed!"); + + printf("Work Efficient SC count is %d \n", count); + + //Delete the heap allocated host memory and free the cuda device memory + delete[] scanRes; + cudaFree(dev_SC); + cudaFree(dev_tempArray); + cudaFree(dev_scanResult); + cudaFree(dev_finalData); + + return count; + } + +#else + +int compactSharedMemory(int n, int* odata, const int* idata) +{ + return 1; +} + +#endif // SHARED_MEMORY } } + diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..da47c04 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -9,5 +9,6 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + int compactSharedMemory(int n, int* odata, const int* idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..a4873f1 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "naive.h" +#include +#define GLM_FORCE_CUDA +#define blockSize 128 namespace StreamCompaction { namespace Naive { @@ -11,15 +14,123 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + int* dev_A; + int* dev_B; + + // TODO: + //This kernal performs an inclusive scan on the in array and stores it in the out array + __global__ void kernNaiveParallelScanInclusive(int offset, int n, int* in, int* out) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + + if (index >= offset) + { + out[index] = in[index - offset] + in[index]; + } + else + { + out[index] = in[index]; + } + } + + //This kernal shifts the elements to right and sets the first elem to 0 + //Inclusive to exclusive + __global__ void kernConvertToExclusive(int n, int* in, int* out) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + return; + if (index == 0) + { + out[0] = 0; + } + else + { + out[index] = in[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + //Allocate memory on device + cudaMalloc((void**)&dev_A, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc A failed!"); + cudaMalloc((void**)&dev_B, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc B failed!"); + + //Copy idata from host into device A and device B + cudaMemcpy(dev_A, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy host to device A failed!"); + cudaMemcpy(dev_B, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy host to device B failed!"); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + timer().startGpuTimer(); // TODO + //Loop similar to the implementation in + //http://users.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf + //Look at Page 10 + for (int d = 1; d < n; d <<= 1) + { + kernNaiveParallelScanInclusive << > > (d, n, dev_A, dev_B); + checkCUDAErrorFn("kernNaiveParallelScanInclusive failed!"); + std::swap(dev_A, dev_B); + } + kernConvertToExclusive << > > (n, dev_A, dev_B); + checkCUDAErrorFn("kernConvertToExclusive failed!"); + timer().endGpuTimer(); + + //Copy back device B into host odata + cudaMemcpy(odata, dev_B, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy device B to host failed!"); + + //for (int i = 0; i < 10; ++i) + //{ + // printf(" odata is %d \n", odata[i]); + //} + + //Free the device memory + cudaFree(dev_A); + cudaFree(dev_B); + } + + //This version can handle arrays only as large as can be processed by a single thread block running on + //one multiprocessor of a GPU. + //__global__ void scanSharedMem(float* g_odata, float* g_idata, int n) + //{ + // extern __shared__ float temp[]; // allocated on invocation + // int thid = threadIdx.x; + // int pout = 0, pin = 1; // Load input into shared memory. + // // This is exclusive scan, so shift right by one + // // and set first element to 0 + + // temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; + // __syncthreads(); + + // for (int offset = 1; offset < n; offset *= 2) + // { + // pout = 1 - pout; // swap double buffer indices + // pin = 1 - pout; + // if (thid >= offset) + // { + // temp[pout * n + thid] += temp[pin * n + thid - offset]; + // } + // else + // { + // temp[pout * n + thid] = temp[pin * n + thid]; + // } + // __syncthreads(); + // } + // g_odata[thid] = temp[pout*n+thid]; // write output + //} } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..fc5b629 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,26 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + thrust::host_vector dv_in(idata, idata + n); + thrust::host_vector dv_out(odata, odata + n); + //thrust::device_vector devin(dv_in); + //thrust::device_vector devout(dv_out); + + timer().startGpuTimer(); + // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + //This is causing an exception + //thrust::exclusive_scan(devin.begin(), devin.end(), devout.begin()); + timer().endGpuTimer(); + + thrust::copy(dv_out.begin(), dv_out.end(), odata); } } }