CIS565-Fall-2018 · Ninjajie · Sep 19, 2018 · Sep 19, 2018 · Sep 19, 2018
diff --git a/README.md b/README.md
@@ -3,12 +3,165 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jie Meng
+  * [LinkedIn](https://www.linkedin.com/in/jie-meng/), [twitter](https://twitter.com/JieMeng6).
+* Tested on: Windows 10, i7-7700HQ @ 2.80GHz, 16GB, GTX 1050 4GB (My personal laptop)
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
+## Background
+### Scan 
+*Prefix Sum*, A.K.A. *Scan*, is a widely used basic algorithm that:
+given an array, for each index, compute the sum of array elements before it.
+It is trivial to implement this on CPU, but we want to parallelize it to run on GPU.
+On GPU, every thread will only access a small part of data. A naive parallel method is shown as the following image:
+
+![](img/prefixsum.png)
+
+Given this naive algorithm, a more efficient algorithm that uses two sweeps is shown as:
+
+Up-Sweep:
+
+![](img/upsweep.jpg)
+
+Down-Sweep:
+
+![](img/downsweep.jpg)
+
+This work-efficient algorithm reduced the complexity from *O(nlogn)* to *O(n)*
+
+
+### Stream Compaction 
+It is another widely used basic algorithm that:
+given an array and a condition, wipe out all the elements that fails the condition. 
+On CPU, it is also trivial to implement; on the GPU, we could also create a parallel version based on the previous scan algorithm:
+* First map each element to a 1 or 0 based on whether it meets the condition;
+* Then scan the mapped array using former parallel scan algoritm
+* Finally scatter the original array
+
+![](img/scatter.jpg)
+
+
+## Project Description
+In this project, I implemented:
+* CPU scan
+* Naive GPU scan
+* Work-Efficient scan
+and
+* CPU stream compaction
+* CPU stream compaction with scan
+* Work-Efficient GPU stream compaction
+Performance analysis on these algoritms are conducted.
+
+*Extra Credits*
+* Optimization of Work-efficient GPU scan and stream compact, and performance comparison against non-optimized version.
+
+### Performance Analysis 
+
+- #### SCAN Algorithm Running Time w.r.t. Array Size 
+  ##### Under Release x64, all @128 block size, work-efficient scan is *optimized*
+
+![](img/scanperformances.png)
+
+  ##### Analysis
+  1. As array size increases, all algorithms take more time to finish, where:
+    * CPU Scanrunning time increases linearly
+    * Naive methods perform good, running time increases slowly and substantially slower than CPU methods when array size goes up exponentially 
+    * Work-efficient methods have fluctuating performances, which I'm not sure why. But it still out-performs CPU methods when array goes large.
+  2. As for power-of-two against NPT performances:
+    * In the work-efficient methods case, NPT usually takes longer to finish scan.
+    * For other algorithms, NPT doesn't make noticable differences.
+  3. Thrust performs weiredly:
+    * Before 2^15 elements, thrust performs stably, and significantly faster than any other algoritms, but:
+    * At 2^15 elements, thrust suddenly takes much longer time to finish than expected, and remains stable afterwards.
+    * From NSight Performance Analysis, not much information could be gained pertaining mechanism behind thrust: you can only notice one memory allocation & free, one Device => Host and one Host => Device.
+      But I guess thrust treat data at different stages, like before/after 2^15 elements. Performances within each stage is stable and optimized, but drop significantly across stages.
+
+  4. When array size is at 2^20(not ploted on image), work-efficient methods are much better than naive methods:
+
+      ![](img/220.png)
+
+- #### SCAN Algorithms Performances w.r.t. Block Size
+  #### Under Release x64, all @2048 elements, work-efficient scan is *optimized*
+
+![](img/scanblocksize.png)
+
+   ##### Analysis
+   1. For Naive method, block size doesn't make any noticable differences
+   2. For Work-Efficient methods, they performs bad at specific block sizes. I'm not sure why, a reasonable guess is error.
+
+- #### Work-Efficient Method: optimized v.s. non-optimized
+  ##### Under Release x64, all @128 block size 
+
+![](img/optscan.png) 
+
+   ##### Analysis
+   As we can see from the data chart, after removal of branch statement in kernel, and only launchs the threads that actually work, the performance
+increases noticeably: I actually didn't expect this, since we are using just global memory here, and the constrain on performance should be memory access.
+
+- #### STREAM COMPACTION performances are similar to Scans:
+  #### @128 block size
+![](img/sc2.png) 
+
+- #### Console Output 
+```
+****************
+** SCAN TESTS **
+****************
+    [   0   8  40  26  45  29  26  49   4  49  38   5  14 ...   9   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.002188ms    (std::chrono Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6397 6406 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.002188ms    (std::chrono Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6354 6355 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.022944ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.021824ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.0392ms    (CUDA Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6397 6406 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.039936ms    (CUDA Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6354 6355 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.009088ms    (CUDA Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6397 6406 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.009152ms    (CUDA Measured)
+    [   0   0   8  48  74 119 148 174 223 227 276 314 319 ... 6354 6355 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   1   2   0   1   1   2   1   3   1   2   0   3 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.004376ms    (std::chrono Measured)
+    [   3   1   2   1   1   2   1   3   1   2   3   3   3 ...   1   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.004011ms    (std::chrono Measured)
+    [   3   1   2   1   1   2   1   3   1   2   3   3   3 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.006564ms    (std::chrono Measured)
+    [   3   1   2   1   1   2   1   3   1   2   3   3   3 ...   1   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.047104ms    (CUDA Measured)
+    [   3   1   2   1   1   2   1   3   1   2   3   3   3 ...   1   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.047104ms    (CUDA Measured)
+    [   3   1   2   1   1   2   1   3   1   2   3   3   3 ...   3   1 ]
+    passed
+Press any key to continue . . .
+```
diff --git a/img/220.png b/img/220.png
diff --git a/img/downsweep.jpg b/img/downsweep.jpg
diff --git a/img/optscan.png b/img/optscan.png
diff --git a/img/prefixsum.png b/img/prefixsum.png
diff --git a/img/sc2.png b/img/sc2.png
diff --git a/img/scanblocksize.png b/img/scanblocksize.png
diff --git a/img/scanperformances.png b/img/scanperformances.png
diff --git a/img/scatter.jpg b/img/scatter.jpg
diff --git a/img/upsweep.jpg b/img/upsweep.jpg
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -54,11 +54,11 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-	onesArray(SIZE, c);
-	printDesc("1s array for finding bugs");
-	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
+	///* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+	//onesArray(SIZE, c);
+	//printDesc("1s array for finding bugs");
+	//StreamCompaction::Naive::scan(SIZE, c, a);
+	//printArray(SIZE, c, true); 
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
@@ -71,28 +71,28 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -117,6 +117,7 @@ int main(int argc, char* argv[]) {
     expectedCount = count;
     printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
+
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
@@ -137,14 +138,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,4 +1,6 @@
 #include "common.h"
+#include <device_launch_parameters.h>
+#include <cuda_runtime_api.h>
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -24,6 +26,13 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n)
+				return;
+			if (idata[index] == 0)
+				bools[index] = 0;
+			else
+				bools[index] = 1;
         }
 
         /**
@@ -33,6 +42,12 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n)
+				return;
+			if (bools[index])
+				odata[indices[index]] = idata[index];
+
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,15 +1,15 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
         }
 
         /**
@@ -20,6 +20,11 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
             // TODO
+			odata[0] = 0;
+			for (int ind = 1; ind < n; ind++)
+			{
+				odata[ind] = odata[ind - 1] + idata[ind - 1];
+			}
 	        timer().endCpuTimer();
         }
 
@@ -31,20 +36,63 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
             // TODO
+			int nonZeros = 0;
+			for (int ind = 0; ind < n; ind++)
+			{
+				if (idata[ind] != 0)
+				{
+					odata[nonZeros] = idata[ind];
+					nonZeros++;
+				}
+			}
 	        timer().endCpuTimer();
-            return -1;
+            return nonZeros;
         }
 
+		/*
+		 * CPU scatter function
+		 */
+		void cpuScatter(int n, int* odata, const int* idata, int* indicator, int* scanned)
+		{
+			for (int ind = 0; ind < n; ind++)
+			{
+				if (indicator[ind] != 0)
+				{
+					odata[scanned[ind]] = idata[ind];
+				}
+			}
+		}
         /**
          * CPU stream compaction using scan and scatter, like the parallel version.
          *
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+			int* indicator= new int[n];
+			int* scanned = new int[n];
 	        timer().startCpuTimer();
 	        // TODO
+			for (int ind = 0; ind < n; ind++)
+			{
+				if (idata[ind] == 0)
+					indicator[ind] = 0;
+				else
+					indicator[ind] = 1;
+			}
+			scanned[0] = 0;
+			for (int ind = 1; ind < n; ind++)
+			{
+				scanned[ind] = scanned[ind - 1] + indicator[ind - 1];
+			}
+			cpuScatter(n, odata, idata, indicator, scanned);
 	        timer().endCpuTimer();
-            return -1;
+
+			int num = scanned[n - 1];
+			if (indicator[n - 1] != 0)
+				num++;
+			delete[] indicator;
+			delete[] scanned;
+            return num;
         }
     }
 }