CIS565-Fall-2020 · JerryYan97 · Sep 17, 2020 · Sep 19, 2020 · Sep 21, 2020
diff --git a/README.md b/README.md
@@ -1,14 +1,121 @@
 CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jiarui Yan
+  * [LinkedIn](https://www.linkedin.com/in/jiarui-yan-a06bb5197?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BvRlITiOMSt%2B9Mgg6SZFKDQ%3D%3D), [personal website](https://jiaruiyan.pb.online/), [twitter](https://twitter.com/JerryYan1997), etc.
+* Tested on: Windows 10 Home, i7-9700K @ 3.60GHz 16GB DDR4 RAM, RTX 2070 SUPER 8GB Dedicated GPU memory (Personal desktop)
 
-### (TODO: Your README)
+## Features
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* CPU Scan & Stream Compaction
 
+* Naive GPU Scan Algorithm
+
+* Work-Efficient GPU Scan & Stream Compaction
+
+* Thrust scan
+
+* Compact threads (Part 5 Extra Credit -- +5)
+
+* Parallel Radix Sort (Part 6 Extra Credit -- +10)
+
+## Roughly optimize for each implementations
+
+In this part, in order to determine the suitable blocksize for each implementations, I run different scenarios under several discrete blocksizes: 16, 32, 64, 128, 256, 512. Then, I collect their average running time and plot them respectively. They are shown below:
+
+![Experiment 1](./img/Naive_GPU_scan_blocksize.PNG)
+
+![Experiment 2](./img/work_efficient_GPU_scan_blocksize.PNG)
+
+![Experiment 3](./img/efficient_compaction_blocksize.PNG)
+
+![Experiment 4](./img/parallel_radix_sort_blocksize.PNG)
+
+As you can see from graphs above, a blocksize that is 32 is suitable for all the scenarios. I think this maybe caused by the fact that the input array size is not long enough. I control the input size for power-of-two scenarios to be 256. Besides, 32 is also the number of threads in a warp. As a result, 32 is a preferable size for all of them. Subsequently, I will fix their blocksize to 32 and do analysis under this blocksize. 
+
+## Performance comparsion
+
+![Experiment 5](./img/scan_time_for_different_array_size.PNG)
+
+As we can see from the graph above, we can figure out that GPU's advantages can be shown when the length of input array is long enough. 
+
+As for Thrust implementation, I guess it implement shared memory and all sorts of tricks to improve the proformence and try their best to exclude operations like memory copy.
+
+## Brief explanation of phenomena
+
+From my observation of the NSight timeline, all of my code are bounded by memory I/O, because the gap between each computation is relatively bigger than computation parts. Besides, each implementation is different. For instance, Naive scan would hand in lots of CUDA kernels, while other implementation hands in less kernels. 
+
+### Output of the test program
+
+Here is the output by inputing an 2^20 array. As for the 'parallel radix sort', please take a look at the thrid part of the ouput below.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [   3  16  44  30   6   8  15  16  11  48  38  32  12 ...  46   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 3.6953ms    (std::chrono Measured)
+    [   0   3  19  63  93  99 107 122 138 149 197 235 267 ... 25674322 25674368 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.3371ms    (std::chrono Measured)
+    [   0   3  19  63  93  99 107 122 138 149 197 235 267 ... 25674263 25674285 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.999616ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.999424ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.227328ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.223232ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.245824ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.309088ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   2   0   0   2   0   3   0   3   0   0   2   0 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.0501ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.0319ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 7.6485ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.7328ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.733216ms    (CUDA Measured)
+    passed
+
+*****************************
+** PARALLEL RADIX SORT TESTS **
+*****************************
+    [   3  16  44  30   6   8  15  16  11  48  38  32  12 ...  46   0 ]
+==== parallel radix sort, power-of-two correct result ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== parallel radix sort, power-of-two correct result ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== parallel radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+   elapsed time: 0.733216ms    (CUDA Measured)
+    passed
+==== parallel radix sort, non-power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+   elapsed time: 0.733216ms    (CUDA Measured)
+    passed
+```
diff --git a/img/Naive_GPU_scan_blocksize.PNG b/img/Naive_GPU_scan_blocksize.PNG
diff --git a/img/data.xlsx b/img/data.xlsx
diff --git a/img/efficient_compaction_blocksize.PNG b/img/efficient_compaction_blocksize.PNG
diff --git a/img/example-1.png b/img/example-1.png
diff --git a/img/example-2.jpg b/img/example-2.jpg
diff --git a/img/figure-39-2.jpg b/img/figure-39-2.jpg
diff --git a/img/figure-39-4.jpg b/img/figure-39-4.jpg
diff --git a/img/parallel_radix_sort_blocksize.PNG b/img/parallel_radix_sort_blocksize.PNG
diff --git a/img/scan trend.xlsx b/img/scan trend.xlsx
diff --git a/img/scan_time_for_different_array_size.PNG b/img/scan_time_for_different_array_size.PNG
diff --git a/img/work_efficient_GPU_scan_blocksize.PNG b/img/work_efficient_GPU_scan_blocksize.PNG
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,13 +11,17 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
+#include <thrust/sort.h>
+#include <thrust/execution_policy.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
+int *d = new int[SIZE];
 
 int main(int argc, char* argv[]) {
     // Scan tests
@@ -47,19 +51,19 @@ int main(int argc, char* argv[]) {
     printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
+    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    onesArray(SIZE, c);
+    printDesc("1s array for finding bugs");
+    StreamCompaction::Naive::scan(SIZE, c, c);
+    printArray(SIZE, c, true);*/
+
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    // printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
-
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
@@ -71,14 +75,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    // printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    // printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -94,6 +98,7 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
+
 
     printf("\n");
     printf("*****************************\n");
@@ -115,22 +120,22 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
+    // printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
-    printArray(count, c, true);
+    // printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    // printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +152,42 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
+    printf("\n");
+    printf("*****************************\n");
+    printf("** PARALLEL RADIX SORT TESTS **\n");
+    printf("*****************************\n");
+
+    // Radix Sort Tests
+    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("parallel radix sort, power-of-two correct result");
+    memcpy(b, a, SIZE * sizeof(int));
+    thrust::sort(thrust::host, b, b + SIZE);
+    printArray(SIZE, b, true);
+
+    printDesc("parallel radix sort, power-of-two correct result");
+    memcpy(d, a, NPOT * sizeof(int));
+    thrust::sort(thrust::host, d, d + NPOT);
+    printArray(NPOT, d, true);
+
+    zeroArray(SIZE, c);
+    printDesc("parallel radix sort, power-of-two");
+    StreamCompaction::RadixSort::radix_sort(SIZE, c, a);
+    printArray(SIZE, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("parallel radix sort, non-power-of-two");
+    StreamCompaction::RadixSort::radix_sort(NPOT, c, a);
+    printArray(NPOT, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(NPOT, d, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,16 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (idata[index] == 0) {
+                bools[index] = 0;
+            }
+            else {
+                bools[index] = 1;
+            }
         }
 
         /**
@@ -33,6 +43,14 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                int final_index = indices[index];
+                odata[final_index] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,11 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int k = 1; k < n; ++k)
+            {
+                odata[k] = odata[k - 1] + idata[k - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +36,15 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int o_idx = 0;
+            for (int i_idx = 0; i_idx < n; ++i_idx) {
+                if (idata[i_idx] != 0) {
+                    odata[o_idx] = idata[i_idx];
+                    ++o_idx;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return o_idx;
         }
 
         /**
@@ -43,8 +55,35 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int* temp_array = new int[n];
+            int* scan_array = new int[n];
+            // Compute temporary array:
+            for (int i_idx = 0; i_idx < n; ++i_idx) {
+                if (idata[i_idx] != 0) {
+                    temp_array[i_idx] = 1;
+                }
+                else {
+                    temp_array[i_idx] = 0;
+                }
+            }
+            // Exclusive scan:
+            scan_array[0] = 0;
+            for (int k = 1; k < n; ++k)
+            {
+                scan_array[k] = scan_array[k - 1] + temp_array[k - 1];
+            }
+            // Scatter:
+            int o_counter = 0;
+            for (int i_idx = 0; i_idx < n; ++i_idx) 
+            {
+                if (temp_array[i_idx] == 1) {
+                    int o_idx = scan_array[i_idx];
+                    odata[o_idx] = idata[i_idx];
+                    ++o_counter;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return o_counter;
         }
     }
 }