CIS565-Fall-2020 · Songsong97 · Sep 18, 2020 · Sep 18, 2020 · Sep 18, 2020 · Sep 20, 2020
diff --git a/README.md b/README.md
@@ -3,12 +3,133 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jilin Liu
+  * [LinkedIn](https://www.linkedin.com/in/jilin-liu97/), [twitter](https://twitter.com/Jilin18043110).
+* Tested on: Windows 10, i7-8750H @ 2.20GHz, 16GB, GTX 1050Ti 4096MB (personal)
 
-### (TODO: Your README)
+## Features and Results
+
+This project includes several parallel algorithms such as exclusive scan and stream compaction. These algorithms can be very useful as independent components of later projects like GPU based path tracer. A serial CPU version of these algorithms is also implemented and used as a performance comparison baseline.
+
+Features:
+1. CPU Scan
+2. CPU Stream Compact
+3. Naive GPU Scan
+4. Work-Efficient Scan
+5. Work-Efficient Stream Compact
+
+Extra Credit:
+1. Faster GPU Implementation
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  30  44  27  31   9  34  15  27  24  39  44  33  38 ...  18   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 17.3105ms    (std::chrono Measured)
+    [   0  30  74 101 132 141 175 190 217 241 280 324 357 ... 102764441 102764459 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 2.102ms    (std::chrono Measured)
+    [   0  30  74 101 132 141 175 190 217 241 280 324 357 ... 102764362 102764372 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 8.62189ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 8.62202ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 3.79942ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 3.77037ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.560736ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.639808ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   0   1   3   3   0   1   1   2   3   0   3   2 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 9.2661ms    (std::chrono Measured)
+    [   1   3   3   1   1   2   3   3   2   3   2   3   3 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 9.253ms    (std::chrono Measured)
+    [   1   3   3   1   1   2   3   3   2   3   2   3   3 ...   2   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 24.8881ms    (std::chrono Measured)
+    [   1   3   3   1   1   2   3   3   2   3   2   3   3 ...   2   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 5.13638ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 5.15882ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+
+A more detailed performance comparison of exclusive scan with respect to the array size is shown below.
+
+![](./images/comp1.JPG)
+
+![](./images/comp2.JPG)
+
+The thrust implementation has the best scalability and runs about 10 times faster than my work-efficient scan. But its execution time still follows a linear increase with respect to the array size. My hypothesis about its efficiency is that it's using shared memory.
+
+In the timeline below, we can easily spot the bottleneck of my implementation. As I precluded the time measurement of device-to-host and host-to-device memory copy, the majority execution time comes from the calling of up-sweep kernel function and down-sweep kernel function. The up-sweep and down-sweep in total take 5 times longer than scatter and 10 times longer than map-to-boolean.
+
+![](./images/t.JPG)
+
+![](./images/t2.JPG)
+
+The thrust scan seems to have two stage as well, as you can see below. But there is an noticable gap between two kernel functions. My guess is that they are packing memory into shared memories so the access can be much faster.
+
+![](./images/thrust.JPG)
+
+In each step of up-sweep and down-sweep, I only used those threads that are necessary to the algorithm. Since in each step we are only accessing and modifying a subset of the array elements, so the number of threads we need is proportional to the size of this subset. This trick eliminates a large number of lazy threads and thus makes the program run faster.
+
+When the array has a trivial size, i.e. 256 elements, the CPU version runs faster(even faster than thrust implementation). The reason is that GPU has a overhead of packing and scheduling threads, which overweighs the benefits from parrallelism when the array size is small, as you can see below.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  38  13  31  10  45   3  41  44  26  12  44  28  12 ...  10   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0005ms    (std::chrono Measured)
+    [   0  38  51  82  92 137 140 181 225 251 263 307 335 ... 6247 6257 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0004ms    (std::chrono Measured)
+    [   0  38  51  82  92 137 140 181 225 251 263 307 335 ... 6169 6198 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.017408ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.017408ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.050176ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.044032ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.04448ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.041984ms    (CUDA Measured)
+    passed
+```
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/images/comp1.JPG b/images/comp1.JPG
diff --git a/images/comp2.JPG b/images/comp2.JPG
diff --git a/images/t.JPG b/images/t.JPG
diff --git a/images/t2.JPG b/images/t2.JPG
diff --git a/images/thrust.JPG b/images/thrust.JPG
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 22; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -34,6 +34,7 @@ int main(int argc, char* argv[]) {
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
     // At first all cases passed because b && c are all zeroes.
+
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,11 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = threadIdx.x + blockIdx.x * blockDim.x;
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +38,13 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = threadIdx.x + blockIdx.x * blockDim.x;
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,11 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int sum = 0; // identity
+            for (int i = 0; i < n; i++) {
+                odata[i] = sum;
+                sum += idata[i];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +36,14 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int p = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) { // remove 0
+                    odata[p++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return p;
         }
 
         /**
@@ -43,8 +54,37 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            if (n < 1) {
+                return 0;
+            }
+            int* e = new int[n];
+            for (int i = 0; i < n; i++) {
+                e[i] = idata[i] != 0;
+            }
+
+            // scan
+            int* prefix = new int[n];
+            int sum = 0;
+            for (int i = 0; i < n; i++) {
+                prefix[i] = sum;
+                sum += e[i];
+            }
+
+            // scatter
+            for (int i = 0; i < n - 1; i++) {
+                if (prefix[i] != prefix[i + 1]) {
+                    odata[prefix[i]] = idata[i];
+                }
+            }
+            int len = prefix[n - 1];
+            if (e[n - 1] == 1) {
+                len++;
+                odata[prefix[n - 1]] = idata[n - 1];
+            }
+            delete[] e;
+            delete[] prefix;
             timer().endCpuTimer();
-            return -1;
+            return len;
         }
     }
 }