CIS565-Fall-2020 · giaosame · Sep 22, 2020 · Sep 22, 2020 · Sep 23, 2020 · Sep 23, 2020
diff --git a/README.md b/README.md
@@ -3,12 +3,145 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Qiaosen Chen
+  * [LinkedIn](https://www.linkedin.com/in/qiaosen-chen-725699141/), etc.
+* Tested on: Windows 10, i5-9400 @ 2.90GHz 16GB, GeForce RTX 2060 6GB (personal computer).
 
-### (TODO: Your README)
+## Implemented Features
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project includes several scan and stream compaction algorithms, and most of them are implemented with parallelism in CUDA. With the serial version of algorithms run on CPU as comparison, we can have a better view on the performance of parallel algorithm run on GPU. 
+
+- CPU Scan & Stream Compaction
+
+- Naive GPU Scan Algorithm
+
+- Work-Efficient GPU Scan & Stream Compaction
+
+- Thrust Scan
+
+- Radix Sort (Extra Credit)
+
+  Please see the details in the last part of the report.
+
+## Performance Analysis
+
+### Performances under Different Block Size 
+
+![Different Block Size](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_blocksize_perf.png)
+
+In general, when ```blockSize = 128```, the parallel version of scan algorithms and compact algorithms could achieve a relative optimized performance.
+
+### Scan Algorithms Performances
+
+![Scan Algorithms Performances](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_arraysize_scan_perf.png)
+
+This performance analysis tested when ```blockSize = 256```, and all algorithms taken into accounts are given the input with an power-of-two array size.
+
+When the input array size is small (```SIZE < 2^15```), the difference of performances is small and not obvious for all the scan algorithms, but the serial version of algorithm run on CPU performs better than those parallel version of algorithms run on GPU. 
+
+When the input array size is large enough (```SIZE > 2^17```), the difference of performances becomes larger and larger, and apparently at this time, ```Thrust::Scan``` performs best among all algorithms. As expected, the ```CPU::Scan``` algorithm performs much worse than ```Thrust::Scan```, it is even worse than ```Naive::Scan```algorithm. However, it is quite weird that, the naive scan algorithm always runs faster than the work-efficient scan algorithm, because the so-called "efficient" work-efficient scan algorithm can still get optimized.  
+
+### Compact Algorithm Performances
+
+![Compact Algorithm Performances](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_arraysize_compact_perf.png)
+
+When the input array size is large enough (```SIZE > 2^17```), the difference of performances become more and more obvious, and both of the compact algorithms perform much worse as the input size increases, as expected. The serial version of compact algorithm run on CPU performs better than the parallel version run in GPU when the input size is small, however, when the input size is quite huge, such as ```SIZE = 2^20```, there is no doubt that the work-efficient compact algorithm run in parallel on GPU perform much better than the CPU version.
+
+### Output
+
+This output tests were based on an array ```SIZE = 1024``` and ```blockSize = 128```:
+
+```bash
+****************
+** SCAN TESTS **
+****************
+    [  46  46  37   6  29   0  28  22  25  23   3  11  29 ...  20   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0013ms    (std::chrono Measured)
+    [   0  46  92 129 135 164 164 192 214 239 262 265 276 ... 24775 24795 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0007ms    (std::chrono Measured)
+    [   0  46  92 129 135 164 164 192 214 239 262 265 276 ... 24725 24751 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.02192ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.021952ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.046304ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.045856ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.038912ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.038176ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   0   1   2   3   2   0   0   1   1   3   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.0026ms    (std::chrono Measured)
+    [   2   1   2   3   2   1   1   3   1   1   1   2   1 ...   2   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.0024ms    (std::chrono Measured)
+    [   2   1   2   3   2   1   1   3   1   1   1   2   1 ...   1   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.0054ms    (std::chrono Measured)
+    [   2   1   2   3   2   1   1   3   1   1   1   2   1 ...   2   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.062976ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.052992ms    (CUDA Measured)
+    passed
+```
+
+## Extra Credit
+
+- **Radix sort**
+
+  I defined the several function used by Radix Sort in [radix.h](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/stream_compaction/radix.h) and implemented it in [radix.cu](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/stream_compaction/radix.cu) under the directory [/stream_compaction](https://github.com/giaosame/Project2-Stream-Compaction/tree/master/stream_compaction). In [main.cpp](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/src/main.cpp), I called this function ```StreamCompaction::Radix::sort``` in the last of the ```main``` function.
+
+  ```c++
+  zeroArray(SIZE, c);
+  printDesc("radix sort, power-of-two");
+  StreamCompaction::CPU::sort(SIZE, b, a);
+  StreamCompaction::Radix::sort(SIZE, c, a);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("radix sort, non-power-of-two");
+  StreamCompaction::CPU::sort(NPOT, b, a);
+  StreamCompaction::Radix::sort(NPOT, c, a);
+  printCmpResult(NPOT, b, c);
+  ```
+
+  Examples of output of Radix Sort:
+
+  ```bash
+  **********************
+  ** RADIX SORT TESTS **
+  **********************
+      [  10  31  19  93  79  96  60  46  46  85  44  56  52  53  85  39 ]
+  ==== radix sort, power-of-two ====
+      [  10  19  31  39  44  46  46  52  53  56  60  79  85  85  93  96 ]
+      passed
+  ==== radix sort, non-power-of-two ====
+      [  10  19  31  44  46  46  52  56  60  79  85  93  96 ]
+      passed
+  ```
+
+
+
+
 
diff --git a/img/different_arraysize_compact_perf.png b/img/different_arraysize_compact_perf.png
diff --git a/img/different_arraysize_scan_perf.png b/img/different_arraysize_scan_perf.png
diff --git a/img/different_blocksize_perf.png b/img/different_blocksize_perf.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -10,10 +10,11 @@
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
+#include <stream_compaction/radix.h>
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 5; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -137,16 +138,36 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    genArray(SIZE, a, 100);  // Leave a 0 at the end to test that edge case
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::CPU::sort(SIZE, b, a);
+    StreamCompaction::Radix::sort(SIZE, c, a);
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, non-power-of-two");
+    StreamCompaction::CPU::sort(NPOT, b, a);
+    StreamCompaction::Radix::sort(NPOT, c, a);
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -14,26 +14,48 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     exit(EXIT_FAILURE);
 }
 
-
 namespace StreamCompaction {
     namespace Common {
 
+        // Initialize array in gpu
+        __global__ void kernInitializeArray(int n, int* a, int value)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index < n)
+            {
+                a[index] = value;
+            }
+        }
+
         /**
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
+
+            bools[index] = idata[index] != 0 ? 1 : 0;
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
-         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         * if bools[index] == 1, it copies idata[index] to odata[indices[index]].
          */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+        __global__ void kernScatter(int n, int* odata, const int* idata, const int* bools, const int* indices) {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
 
+            if (bools[index])
+            {
+                odata[indices[index]] = idata[index];
+            }
+        }
     }
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -2,7 +2,6 @@
 
 #include <cuda.h>
 #include <cuda_runtime.h>
-
 #include <cstdio>
 #include <cstring>
 #include <cmath>
@@ -13,6 +12,9 @@
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+// Block size used for CUDA kernel launch
+#define blockSize 256
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
@@ -32,6 +34,8 @@ inline int ilog2ceil(int x) {
 
 namespace StreamCompaction {
     namespace Common {
+        __global__ void kernInitializeArray(int n, int* a, int value);
+
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
 
         __global__ void kernScatter(int n, int *odata,

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,14 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+
+            int prefixSum = 0;
+            for (int i = 0; i < n; i++)
+            {
+                odata[i] = prefixSum;
+                prefixSum += idata[i];
+            }
+
             timer().endCpuTimer();
         }
 
@@ -30,9 +37,16 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+
+            int cnt = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (idata[i] != 0)
+                    odata[cnt++] = idata[i];
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return cnt;
         }
 
         /**
@@ -41,10 +55,40 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            // Compute temporary array and run exclusive scan on temporary array
+            int prefixSum = 0;
+            int* tdata = new int[n];
+            int* sdata = new int[n];
+
             timer().startCpuTimer();
-            // TODO
+            for (int i = 0; i < n; i++)
+            {
+                tdata[i] = idata[i] != 0 ? 1 : 0;
+                sdata[i] = prefixSum;
+                prefixSum += tdata[i];
+            }
+
+            // Scatter
+            int idx = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (tdata[i])
+                {
+                    idx = sdata[i];
+                    odata[idx] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return idx + 1;
+        }
+
+        void sort(int n, int* odata, const int* idata)
+        {
+            for (int i = 0; i < n; i++)
+            {
+                odata[i] = idata[i];
+            }
+            std::sort(odata, odata + n);
         }
     }
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
@@ -8,6 +8,8 @@ namespace StreamCompaction {
 
         void scan(int n, int *odata, const int *idata);
 
+        void sort(int n, int* odata, const int* idata);
+
         int compactWithoutScan(int n, int *odata, const int *idata);
 
         int compactWithScan(int n, int *odata, const int *idata);