CIS565-Fall-2020 · swebsterbass · Sep 21, 2020 · Sep 22, 2020 · Sep 22, 2020 · Sep 22, 2020
diff --git a/README.md b/README.md
@@ -3,12 +3,171 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* SPENCER WEBSTER-BASS
+  * [LinkedIn](https://www.linkedin.com/in/spencer-webster-bass/)
+* Tested on: Windows 10, i7-6700 @ 3.40GHz 16GB, Quadro P1000 222MB (MOR100B-19)
 
-### (TODO: Your README)
+### DESCRIPTION
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project is an implementation of the stream compaction parallel algorithm on the GPU using CUDA and C++.
+
+Features:
+* Serial implementation of scan and stream compaction algorithms on the CPU
+* Naive, parallel implementation of scan and stream compaction algorithms on the GPU
+* Atepted work-efficient, parallel implementation of scan and stream compaction algorithms on the GPU
+* Comparison between my implementations' efficiency and thrust's implementation of exclusive scan algorithm
+
+TODOs:
+Include Analysis
+
+****************
+** SCAN TESTS **
+****************
+    [  49   8   2  27  20  44  21  27  49   3  20   3  16   3  31   0 ]
+
+==== cpu scan, power-of-two ====
+
+   elapsed time: 0ms    (std::chrono Measured)
+
+    [   0  49  57  59  86 106 150 171 198 247 250 270 273 289 292 323 ]
+
+==== cpu scan, non-power-of-two ====
+
+   elapsed time: 0ms    (std::chrono Measured)
+
+    [   0  49  57  59  86 106 150 171 198 247 250 270 273 ]
+
+    passed
+
+==== naive scan, power-of-two ====
+
+    [  49   8   2  27  20  44  21  27  49   3  20   3  16   3  31   0 ]
+
+    [ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]
+
+    [ 0 49 57 10 29 47 64 65 48 76 52 23 23 19 19 34 ]
+
+    [ 0 49 57 59 86 57 93 112 112 141 100 99 75 42 42 53 ]
+
+    [ 0 49 57 59 86 106 150 171 198 198 193 211 187 183 142 152 ]
+
+    [ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]
+
+   elapsed time: 7.58922ms    (CUDA Measured)
+
+    [   0  49  57  59  86 106 150 171 198 247 250 270 273 289 292 323 ]
+
+    passed
+
+==== 1s array for finding bugs ====
+
+    [  49   8   2  27  20  44  21  27  49   3  20   3  16   3  31   0 ]
+
+    [ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]
+
+    [ 0 49 57 10 29 47 64 65 48 76 52 23 23 19 19 34 ]
+
+    [ 0 49 57 59 86 57 93 112 112 141 100 99 75 42 42 53 ]
+
+    [ 0 49 57 59 86 106 150 171 198 198 193 211 187 183 142 152 ]
+
+    [ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]
+
+    [   0  49  57  59  86 106 150 171 198 247 250 270 273 289 292 323 ]
+
+==== naive scan, non-power-of-two ====
+
+    [  49   8   2  27  20  44  21  27  49   3  20   3  16   3  31   0 ]
+
+    [ 49 8 2 27 20 44 21 27 49 3 20 3 16 ]
+
+    [ 0 49 57 10 29 47 64 65 48 76 52 23 23 3 0 0 ]
+
+    [ 0 49 57 59 86 57 93 112 112 141 100 99 75 26 23 3 ]
+
+    [ 0 49 57 59 86 106 150 171 198 198 193 211 187 167 123 102 ]
+
+    [ 0 49 57 59 86 106 150 171 198 247 250 270 273 273 273 273 ]
+
+   elapsed time: 15.0825ms    (CUDA Measured)
+
+    [   0  49  57  59  86 106 150 171 198 247 250 270 273   0   0   0 ]
+
+    passed
+
+==== work-efficient scan, power-of-two ====
+
+   elapsed time: 0ms    (CUDA Measured)
+
+    a[1] = 49, b[1] = 0
+
+    FAIL VALUE
+
+==== work-efficient scan, non-power-of-two ====
+
+   elapsed time: 0ms    (CUDA Measured)
+
+    a[1] = 49, b[1] = 0
+
+    FAIL VALUE
+
+==== thrust scan, power-of-two ====
+
+   elapsed time: 0.083008ms    (CUDA Measured)
+
+    passed
+
+==== thrust scan, non-power-of-two ====
+
+   elapsed time: 0.069632ms    (CUDA Measured)
+
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+
+    [   1   2   2   1   2   0   1   3   1   3   0   3   0   3   3   0 ]
+
+==== cpu compact without scan, power-of-two ====
+
+   elapsed time: 0.0034ms    (std::chrono Measured)
+
+    [   0   0   0   0   0   0   0   0   0   0   0   0 ]
+
+    passed
+
+==== cpu compact without scan, non-power-of-two ====
+
+   elapsed time: 0.004ms    (std::chrono Measured)
+
+    [   0   0   0   0   0   0   0   0   0   0 ]
+
+    passed
+
+==== cpu compact with scan ====
+
+   elapsed time: 0.0023ms    (std::chrono Measured)
+
+    [ ]
+
+    expected 12 elements, got -1
+
+    FAIL COUNT
+
+==== work-efficient compact, power-of-two ====
+
+   elapsed time: 0ms    (CUDA Measured)
+
+    expected 12 elements, got -1
+
+    FAIL COUNT
+
+==== work-efficient compact, non-power-of-two ====
+
+   elapsed time: 0ms    (CUDA Measured)
+
+    expected 10 elements, got -1
+
+    FAIL COUNT
 
diff --git a/notes.txt b/notes.txt
@@ -0,0 +1,34 @@
+How does dim3 work? and dimensions in CUDA. Does dim3 set unspecified arguments to 1 or 0?
+Does CUDA expect unused dimensions to be 1 or 0?
+
+
+
+How do these checkCUDAError work? They sometimes say that an entire is occurring at an incorrect location if 
+I dont have one at every CUDA function call.
+
+
+
+Do we need to include new functions in header files?
+
+
+
+Inside of CUDA files are we using c or C++
+
+Dont have classes. The coding style might be closer to C instead of C++. You can pass structs to CUDA kernels.
+
+When using the memory window is it showing you gpu or cpu memory when you copy and paste an address
+from the locals or autos window?
+
+Can memcpy from the device back to the host.
+And start with a smaller sized buffer so that you can check the values in the buffers by hand.
+
+
+dev_data1, dev_data2;
+// cudamalloc, memcpy, etc
+// for eah iteration, launch kernels on dev_data1 and dev_data2
+int* temp = dev_data1
+dev_dta1 = dev_data2
+dev_data2 = temp
+
+
+The weird alternating thing where every other value was zero was due to the book's funky way of ping ponging.
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 4; //8; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -49,22 +49,25 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
+    printArray(SIZE, a, true);
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan */
     onesArray(SIZE, c);
     printDesc("1s array for finding bugs");
+    printArray(SIZE, a, true);
     StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    printArray(SIZE, c, true);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
+    printArray(SIZE, a, true);
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,13 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n)
+                return;
+            if (idata[index])
+                bools[index] = 1;
+            else
+                bools[index] = 0;
         }
 
         /**
@@ -33,6 +40,11 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n)
+                return;
+            if (bools[index])
+                odata[indices[index]] = idata[index];
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,4 +1,5 @@
 #include <cstdio>
+#include <vector>
 #include "cpu.h"
 
 #include "common.h"
@@ -18,9 +19,12 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            // timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
+            odata[0] = 0;
+            for (int i = 1; i < n; i++)
+                odata[i] = odata[i - 1] + idata[i - 1];
+            // timer().endCpuTimer();
         }
 
         /**
@@ -31,8 +35,13 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            std::vector<int> o = std::vector<int>();
+            for (int i = 0; i < n; i++)
+                if (idata[i])
+                    o.push_back(idata[i]);
+            odata = o.data();
             timer().endCpuTimer();
-            return -1;
+            return o.size();
         }
 
         /**
@@ -43,8 +52,38 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            // step 1: compute bit mask
+            std::vector<int> mask(n);
+            for (int i = 0; i < n; i++) {
+                if (idata[i] == 0) {
+                    mask.at(i) = 0;
+                }
+                else {
+                    mask.at(i) = 1;
+                }
+            }
+
+            // step 2: exclusive scan 
+            scan(n, odata, mask.data());
+
             timer().endCpuTimer();
             return -1;
+
+            // step 3: scatter
+            int m = odata[n - 1];
+            std::vector<int> ovec(m);
+            m = 0;
+            for (int i = 0; i < n; i++) {
+                if (mask[i]) {
+                    ovec[odata[i]] = idata[i];
+                    m++;
+                }
+            }
+
+            odata = ovec.data();
+
+            timer().endCpuTimer();
+            return m;
         }
     }
 }