diff --git a/README.md b/README.md
index 0e38ddb..3d144b8 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,142 @@
 CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
+Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Vasu Mahesh
+  * [LinkedIn](http://linkedin.com/in/vasumahesh)
+  * [Code Blog](http://www.codeplaysleep.com)
 
-### (TODO: Your README)
+* Tested on a Laptop:
+  * Windows 10
+  * i7-8650U @ 1.90GHz
+  * 16GB RAM
+  * GTX 1060 6GB
+  * Visual Studio 2017 (with v140 toolset)
+  * CUDA v8.0
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](img/intro.PNG)
 
+## Build
+
+Build Command: 
+```
+cmake -G "Visual Studio 15 2017 Win64" -DCUDA_TOOLKIT_ROOT_DIR="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0" -T v140,cuda=8.0 ..
+```
+I tried to implement the shared memory and thread optimziation. But had issues with DownSweep and kind of ran out of time.
+
+## Results
+
+### Scan Performance
+
+If we just implement the barebones scan algorithm it turns out that they are quite slower. A point to note here Thrust initially had a pretty slow time of around 4ms and then the subsequent Thrust calls are very fast. I suspect Thrust lazy loads some modules and since the thrust scan function was the 1st one, it affected the readings. To avoid this, I ended up calling scan again and got optimal performance.
+
+![](img/scanPO2.PNG)
+
+![](img/scanNPO2.PNG)
+
+| Elements in Array  | CPU Scan (Power of 2) | CPU Scan (Non Power of 2) | Naive Scan (Power of 2) | Naive Scan (Non Power of 2) | Work Efficient Scan (Power of 2) | Work Efficient Scan (Non Power of 2) | Thrust Scan (Power of 2) | Thrust Scan (Non Power of 2) |
+| ----------- |------------------------------------|-------------------------------------- |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
+| 2^8  | 0.000485 | 0.000485 | 0.027648 | 0.02672  | 0.062464 | 0.053248 | 0.012288 | 0.012288 |
+| 2^9  | 0.00097  | 0.001455 | 0.07168  | 0.068608 | 0.16384  | 0.159744 | 0.037888 | 0.065536 |
+| 2^10 | 0.001454 | 0.001455 | 0.121856 | 0.120832 | 0.262144 | 0.260096 | 0.066592 | 0.065536 |
+| 2^11 | 0.002909 | 0.00291  | 0.138208 | 0.13824  | 0.292864 | 0.29696  | 0.099328 | 0.098304 |
+| 2^12 | 0.006303 | 0.007272 | 0.094208 | 0.093184 | 0.195584 | 0.195584 | 0.068608 | 0.068608 |
+| 2^13 | 0.011636 | 0.012121 | 0.104416 | 0.104448 | 0.227264 | 0.22528  | 0.114688 | 0.114656 |
+| 2^14 | 0.024727 | 0.024242 | 0.134176 | 0.123904 | 0.32256  | 0.31744  | 0.19248  | 0.191456 |
+| 2^15 | 0.049455 | 0.055273 | 0.16688  | 0.166912 | 0.40336  | 0.423904 | 0.224224 | 0.22016  |
+| 2^16 | 0.090182 | 0.100364 | 0.42864  | 0.427808 | 1.12845  | 1.12813  | 0.39424  | 0.343072 |
+| 2^17 | 0.188121 | 0.27103  | 0.390816 | 0.390144 | 1.0687   | 1.05984  | 0.55088  | 0.451584 |
+| 2^18 | 1.07976  | 0.442182 | 1.35885  | 1.34733  | 3.74992  | 3.77731  | 0.605184 | 0.628736 |
+| 2^19 | 2.2177   | 0.796606 | 2.67267  | 2.64694  | 7.54173  | 7.57734  | 0.754688 | 0.709632 |
+| 2^20 | 5.36679  | 1.74352  | 3.00848  | 2.95424  | 15.3444  | 15.3364  | 1.1745   | 1.1223   |
+
+There seems to be such a discrepency in my Work Efficient one because I believe it is highly inefficient (shared memory difference below). The Work efficient one performed poorly for me even against my CPU. This could be because the threads aren't grouped and the waste the SM's utilization. After the class, I realized that the implemented code could be way faster as I was launching N threads for N numbers (in total). This was later resolved to N/2. The values however you see are for the "naive" version of the work efficient algorithm.
+
+### Scan with Shared Memory
+
+I implemented a partial implementation of the Work Efficient Scan (Only UpSweep). I got a significant boost because I ended up launching only the required amount of threads for the entries in the array. However, this implementation was a little different and I believe less efficient than the one shown in class. This is because the threads are not clumped together. Grouping of threads is much better because it helps in early warp termination.
+
+![](img/shared.PNG)
+
+### Compaction Performance
+
+I suffered similar performance issues with work efficient scans in the compact as well.
+
+![](img/compactPO2.PNG)
+
+![](img/compactNPO2.PNG)
+
+| Elements in Array  | CPU Compact without scan (Power of 2) | CPU Compact without scan (Non Power of 2) | CPU Compact with Scan | Work Efficient Compact (Power of 2) | Work Efficient Compact (Non Power of 2) |
+| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
+| 2^8  | 0.000969 | 0.000969 | 0.00097  | 0.06144  | 0.072704 |
+| 2^9  | 0.020849 | 0.001455 | 0.002909 | 0.291872 | 0.274432 |
+| 2^10 | 0.003394 | 0.002424 | 0.004849 | 0.296928 | 0.296928 |
+| 2^11 | 0.007758 | 0.005819 | 0.010182 | 0.330784 | 0.338912 |
+| 2^12 | 0.011636 | 0.016    | 0.050909 | 0.234464 | 0.238592 |
+| 2^13 | 0.054304 | 0.020848 | 0.055272 | 0.442304 | 0.413696 |
+| 2^14 | 0.077091 | 0.067878 | 0.092121 | 0.557024 | 0.5816   |
+| 2^15 | 0.123152 | 0.112    | 0.157576 | 0.785088 | 0.767744 |
+| 2^16 | 0.246788 | 0.269576 | 0.343758 | 1.21734  | 1.23866  |
+| 2^17 | 0.390788 | 0.402424 | 0.555637 | 2.16781  | 2.13808  |
+| 2^18 | 1.06182  | 0.744727 | 1.53648  | 4.00554  | 4.0591   |
+| 2^19 | 1.488    | 1.48994  | 2.63418  | 7.96368  | 7.97901  |
+| 2^20 | 2.69624  | 2.80679  | 4.72727  | 16.1515  | 16.1321  |
+
+
+### Output Log
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...   4   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 7.32024ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 2.39758ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.45482ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.45306ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.80938ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.8143ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.02115ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.503808ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.471008ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 4.38158ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 3.77454ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 12.5959ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.10064ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 2.10413ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+```
\ No newline at end of file
diff --git a/img/compactNPO2.PNG b/img/compactNPO2.PNG
new file mode 100644
index 0000000..b40e02e
Binary files /dev/null and b/img/compactNPO2.PNG differ
diff --git a/img/compactPO2.PNG b/img/compactPO2.PNG
new file mode 100644
index 0000000..afb113f
Binary files /dev/null and b/img/compactPO2.PNG differ
diff --git a/img/intro.PNG b/img/intro.PNG
new file mode 100644
index 0000000..0b7daec
Binary files /dev/null and b/img/intro.PNG differ
diff --git a/img/scanNPO2.PNG b/img/scanNPO2.PNG
new file mode 100644
index 0000000..7dba75b
Binary files /dev/null and b/img/scanNPO2.PNG differ
diff --git a/img/scanPO2.PNG b/img/scanPO2.PNG
new file mode 100644
index 0000000..f188838
Binary files /dev/null and b/img/scanPO2.PNG differ
diff --git a/img/shared.PNG b/img/shared.PNG
new file mode 100644
index 0000000..db22ca3
Binary files /dev/null and b/img/shared.PNG differ
diff --git a/src/main.cpp b/src/main.cpp
index 1850161..d2c8fed 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,142 +13,157 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int *a = new int[SIZE];
-int *b = new int[SIZE];
-int *c = new int[SIZE];
-
-int main(int argc, char* argv[]) {
-    // Scan tests
-
-    printf("\n");
-    printf("****************\n");
-    printf("** SCAN TESTS **\n");
-    printf("****************\n");
-
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    // initialize b using StreamCompaction::CPU::scan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
-    // At first all cases passed because b && c are all zeroes.
-    zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-	onesArray(SIZE, c);
-	printDesc("1s array for finding bugs");
-	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    printf("\n");
-    printf("*****************************\n");
-    printf("** STREAM COMPACTION TESTS **\n");
-    printf("*****************************\n");
-
-    // Compaction tests
-
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    int count, expectedCount, expectedNPOT;
-
-    // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
-    zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
-    count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
-    count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    system("pause"); // stop Win32 console from closing on exit
-	delete[] a;
-	delete[] b;
-	delete[] c;
+int* a = new int[SIZE];
+int* b = new int[SIZE];
+int* c = new int[SIZE];
+
+int main(int argc, char* argv[])
+{
+  // Scan tests
+
+  printf("\n");
+  printf("****************\n");
+  printf("** SCAN TESTS **\n");
+  printf("****************\n");
+
+  genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
+  a[SIZE - 1] = 0;
+  printArray(SIZE, a, true);
+
+  // initialize b using StreamCompaction::CPU::scan you implement
+  // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
+  // At first all cases passed because b && c are all zeroes.
+  zeroArray(SIZE, b);
+  printDesc("cpu scan, power-of-two");
+  StreamCompaction::CPU::scan(SIZE, b, a);
+  printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+  // printArray(SIZE, b, true);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu scan, non-power-of-two");
+  StreamCompaction::CPU::scan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+  //printArray(NPOT, b, true);
+  printCmpResult(NPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("naive scan, power-of-two");
+  StreamCompaction::Naive::scan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+  onesArray(SIZE, c);
+  printDesc("1s array for finding bugs");
+  StreamCompaction::Naive::scan(SIZE, c, a);
+  printArray(SIZE, c, true); */
+
+  zeroArray(SIZE, c);
+  printDesc("naive scan, non-power-of-two");
+  StreamCompaction::Naive::scan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(NPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient scan, power-of-two");
+  StreamCompaction::Efficient::scan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient scan, non-power-of-two");
+  StreamCompaction::Efficient::scan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(NPOT, c, true);
+  printCmpResult(NPOT, b, c);
+
+  // zeroArray(SIZE, c);
+  // printDesc("work-efficient scan optimized v1, power-of-two");
+  // StreamCompaction::Efficient::scanOptimized_v1(SIZE, c, a);
+  // printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  // printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("thrust scan, power-of-two");
+  StreamCompaction::Thrust::scan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("thrust scan, power-of-two");
+  StreamCompaction::Thrust::scan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("thrust scan, non-power-of-two");
+  StreamCompaction::Thrust::scan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  // printArray(NPOT, c, true);
+  printCmpResult(NPOT, b, c);
+
+  printf("\n");
+  printf("*****************************\n");
+  printf("** STREAM COMPACTION TESTS **\n");
+  printf("*****************************\n");
+
+  // Compaction tests
+
+  genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
+  a[SIZE - 1] = 0;
+  printArray(SIZE, a, true);
+
+  int count, expectedCount, expectedNPOT;
+
+  // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
+  // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
+  zeroArray(SIZE, b);
+  printDesc("cpu compact without scan, power-of-two");
+  count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+  printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+  expectedCount = count;
+  // printArray(count, b, true);
+  printCmpLenResult(count, expectedCount, b, b);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu compact without scan, non-power-of-two");
+  count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+  expectedNPOT = count;
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedNPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu compact with scan");
+  count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedCount, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient compact, power-of-two");
+  count = StreamCompaction::Efficient::compact(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  //printArray(count, c, true);
+  printCmpLenResult(count, expectedCount, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient compact, non-power-of-two");
+  count = StreamCompaction::Efficient::compact(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+  //printArray(count, c, true);
+  printCmpLenResult(count, expectedNPOT, b, c);
+
+  system("pause"); // stop Win32 console from closing on exit
+  delete[] a;
+  delete[] b;
+  delete[] c;
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index 46337ab..87028e5 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -50,7 +50,8 @@ void onesArray(int n, int *a) {
 }
 
 void genArray(int n, int *a, int maxval) {
-    srand(time(nullptr));
+    // srand(time(nullptr));
+    srand(1231412);
 
     for (int i = 0; i < n; i++) {
         a[i] = rand() % maxval;
diff --git a/stats.txt b/stats.txt
new file mode 100644
index 0000000..d261f0c
--- /dev/null
+++ b/stats.txt
@@ -0,0 +1,683 @@
+2^8
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  42   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.000485ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.000485ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.027648ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.02672ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.062464ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.053248ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 2.9655ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.012288ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.012288ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.000969ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.000969ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.00097ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.06144ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.072704ms    (CUDA Measured)
+    passed
+
+
+2^9
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  42   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.00097ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.001455ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.07168ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.068608ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.16384ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.159744ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.3432ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.037888ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.065536ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.020849ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.001455ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.002909ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.291872ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.274432ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^10
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  33   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.001454ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.001455ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.121856ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.120832ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.262144ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.260096ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.86912ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.066592ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.065536ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.003394ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.002424ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.004849ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.296928ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.296928ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^11
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  32   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.002909ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.00291ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.138208ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.13824ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.292864ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.29696ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.81174ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.099328ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.098304ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.007758ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.005819ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.010182ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.330784ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.338912ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^12
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  34   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.006303ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.007272ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.094208ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.093184ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.195584ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.195584ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.67661ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.068608ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.068608ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.011636ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.016ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.050909ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.234464ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.238592ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^13
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  37   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.011636ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.012121ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.104416ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.104448ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.227264ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.22528ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.12ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.114688ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.114656ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.054304ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.020848ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.055272ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.442304ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.413696ms    (CUDA Measured)
+    passed
+
+2^14
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  39   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.024727ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.024242ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.134176ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.123904ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.32256ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.31744ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.11894ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.19248ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.191456ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.077091ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.067878ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.092121ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.557024ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.5816ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^15
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  30   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.049455ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.055273ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.16688ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.166912ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.40336ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.423904ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.97254ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.224224ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.22016ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.123152ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.112ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.157576ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.785088ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.767744ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+
+2^16
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  15   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.090182ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.100364ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.42864ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.427808ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.12845ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.12813ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.9095ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.39424ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.343072ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.246788ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.269576ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.343758ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.21734ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.23866ms    (CUDA Measured)
+    passed
+
+2^17
+
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  48   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.188121ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.27103ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.390816ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.390144ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.0687ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.05984ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.56237ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.55088ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.451584ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.390788ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.402424ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.555637ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.16781ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 2.13808ms    (CUDA Measured)
+    passed
+
+2^18
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  32   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.07976ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.442182ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.35885ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.34733ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 3.74992ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 3.77731ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 5.58698ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.605184ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.628736ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 1.06182ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.744727ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 1.53648ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 4.00554ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 4.0591ms    (CUDA Measured)
+    passed
+
+2^19
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...  18   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 2.2177ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.796606ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 2.67267ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 2.64694ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 7.54173ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 7.57734ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 6.18493ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.754688ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.709632ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 1.488ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 1.48994ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2.63418ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 7.96368ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 7.97901ms    (CUDA Measured)
+    passed
+
+2^20
+****************
+** SCAN TESTS **
+****************
+    [  16  29  39  25  25  14  11  20   0  34  28  11  33 ...   4   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 5.36679ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.74352ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 3.00848ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 2.95424ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 15.3444ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 15.3364ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 6.64576ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 1.1745ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.1223ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   1   3   3   3   0   1   2   0   0   0   1   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.69624ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.80679ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 4.72727ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 16.1515ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 16.1321ms    (CUDA Measured)
+    passed
\ No newline at end of file
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..c8709e7 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -13,5 +13,5 @@ set(SOURCE_FILES
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 8fc0211..e189baa 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -22,17 +22,34 @@ namespace StreamCompaction {
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+        __global__ void kernMapToBoolean(int N, int *bools, const int *idata) {
+            
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= N) {
+              return;
+            }
+
+            bools[index] = int(idata[index] != 0);
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
-        __global__ void kernScatter(int n, int *odata,
+        __global__ void kernScatter(int N, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+          
+          const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+          if (index >= N) {
+            return;
+          }
+
+          const int inputData = idata[index];
+          if (bools[index] == 1)
+          {
+            const int outIdx = indices[index];
+            odata[outIdx] = inputData;
+          }
         }
 
     }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 99a1b04..8839022 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -16,117 +16,132 @@
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
-void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
-
-inline int ilog2(int x) {
-    int lg = 0;
-    while (x >>= 1) {
-        ++lg;
-    }
-    return lg;
+void checkCUDAErrorFn(const char* msg, const char* file = NULL, int line = -1);
+
+inline int ilog2(int x)
+{
+  int lg = 0;
+  while (x >>= 1)
+  {
+    ++lg;
+  }
+  return lg;
 }
 
-inline int ilog2ceil(int x) {
-    return x == 1 ? 0 : ilog2(x - 1) + 1;
+inline int ilog2ceil(int x)
+{
+  return x == 1 ? 0 : ilog2(x - 1) + 1;
 }
 
-namespace StreamCompaction {
-    namespace Common {
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
-
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices);
-
-	    /**
-	    * This class is used for timing the performance
-	    * Uncopyable and unmovable
-        *
-        * Adapted from WindyDarian(https://github.com/WindyDarian)
-	    */
-	    class PerformanceTimer
-	    {
-	    public:
-		    PerformanceTimer()
-		    {
-			    cudaEventCreate(&event_start);
-			    cudaEventCreate(&event_end);
-		    }
-
-		    ~PerformanceTimer()
-		    {
-			    cudaEventDestroy(event_start);
-			    cudaEventDestroy(event_end);
-		    }
-
-		    void startCpuTimer()
-		    {
-			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
-			    cpu_timer_started = true;
-
-			    time_start_cpu = std::chrono::high_resolution_clock::now();
-		    }
-
-		    void endCpuTimer()
-		    {
-			    time_end_cpu = std::chrono::high_resolution_clock::now();
-
-			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
-
-			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
-			    prev_elapsed_time_cpu_milliseconds =
-				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
-
-			    cpu_timer_started = false;
-		    }
-
-		    void startGpuTimer()
-		    {
-			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
-			    gpu_timer_started = true;
-
-			    cudaEventRecord(event_start);
-		    }
-
-		    void endGpuTimer()
-		    {
-			    cudaEventRecord(event_end);
-			    cudaEventSynchronize(event_end);
-
-			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
-
-			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
-			    gpu_timer_started = false;
-		    }
-
-		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
-		    {
-			    return prev_elapsed_time_cpu_milliseconds;
-		    }
-
-		    float getGpuElapsedTimeForPreviousOperation() //noexcept
-		    {
-			    return prev_elapsed_time_gpu_milliseconds;
-		    }
-
-		    // remove copy and move functions
-		    PerformanceTimer(const PerformanceTimer&) = delete;
-		    PerformanceTimer(PerformanceTimer&&) = delete;
-		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
-		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
-
-	    private:
-		    cudaEvent_t event_start = nullptr;
-		    cudaEvent_t event_end = nullptr;
-
-		    using time_point_t = std::chrono::high_resolution_clock::time_point;
-		    time_point_t time_start_cpu;
-		    time_point_t time_end_cpu;
-
-		    bool cpu_timer_started = false;
-		    bool gpu_timer_started = false;
-
-		    float prev_elapsed_time_cpu_milliseconds = 0.f;
-		    float prev_elapsed_time_gpu_milliseconds = 0.f;
-	    };
-    }
+namespace StreamCompaction
+{
+  namespace Common
+  {
+    __global__ void kernMapToBoolean(int n, int* bools, const int* idata);
+
+    __global__ void kernScatter(int n, int* odata,
+                                const int* idata, const int* bools, const int* indices);
+
+    /**
+    * This class is used for timing the performance
+    * Uncopyable and unmovable
+      *
+      * Adapted from WindyDarian(https://github.com/WindyDarian)
+    */
+    class PerformanceTimer
+    {
+    public:
+      PerformanceTimer()
+      {
+        cudaEventCreate(&event_start);
+        cudaEventCreate(&event_end);
+      }
+
+      ~PerformanceTimer()
+      {
+        cudaEventDestroy(event_start);
+        cudaEventDestroy(event_end);
+      }
+
+      void startCpuTimer()
+      {
+        if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+        cpu_timer_started = true;
+
+        time_start_cpu = std::chrono::high_resolution_clock::now();
+      }
+
+      void endCpuTimer()
+      {
+        time_end_cpu = std::chrono::high_resolution_clock::now();
+
+        if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+        std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+        prev_elapsed_time_cpu_milliseconds =
+          static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+        cpu_timer_started = false;
+
+        accumulated_cpu_time_milliseconds += prev_elapsed_time_cpu_milliseconds;
+      }
+
+      void startGpuTimer()
+      {
+        if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+        gpu_timer_started = true;
+
+        cudaEventRecord(event_start);
+      }
+
+      void endGpuTimer()
+      {
+        cudaEventRecord(event_end);
+        cudaEventSynchronize(event_end);
+
+        if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+        cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+        gpu_timer_started = false;
+
+        accumulated_gpu_time_milliseconds += prev_elapsed_time_gpu_milliseconds;
+      }
+
+      float getCpuElapsedTimeForPreviousOperation() //noexcept
+      {
+        const float returnTime = accumulated_cpu_time_milliseconds;
+        accumulated_cpu_time_milliseconds = 0;
+        return returnTime;
+      }
+
+      float getGpuElapsedTimeForPreviousOperation() //noexcept
+      {
+        const float returnTime = accumulated_gpu_time_milliseconds;
+        accumulated_gpu_time_milliseconds = 0;
+        return returnTime;
+      }
+
+      // remove copy and move functions
+      PerformanceTimer(const PerformanceTimer&) = delete;
+      PerformanceTimer(PerformanceTimer&&) = delete;
+      PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+      PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+    private:
+      cudaEvent_t event_start = nullptr;
+      cudaEvent_t event_end = nullptr;
+
+      using time_point_t = std::chrono::high_resolution_clock::time_point;
+      time_point_t time_start_cpu;
+      time_point_t time_end_cpu;
+
+      bool cpu_timer_started = false;
+      bool gpu_timer_started = false;
+
+      float prev_elapsed_time_cpu_milliseconds = 0.f;
+      float prev_elapsed_time_gpu_milliseconds = 0.f;
+      float accumulated_cpu_time_milliseconds = 0.f;
+      float accumulated_gpu_time_milliseconds = 0.f;
+    };
+  }
 }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 05ce667..b430dad 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,15 +1,16 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
+#include <memory>
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
         }
 
         /**
@@ -19,7 +20,12 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-            // TODO
+          odata[0] = 0;
+          for (int idx = 1; idx < n; ++idx)
+          {
+            odata[idx] = odata[idx - 1] + idata[idx - 1];
+          }
+
 	        timer().endCpuTimer();
         }
 
@@ -30,9 +36,20 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-            // TODO
+
+          int outIdx = 0;
+          for (int idx = 0; idx < n; ++idx)
+          {
+            const int inputData = idata[idx];
+            if (inputData != 0)
+            {
+              odata[outIdx] = inputData;
+              ++outIdx;
+            }
+          }
+
 	        timer().endCpuTimer();
-            return -1;
+          return outIdx;
         }
 
         /**
@@ -41,10 +58,36 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+          const std::unique_ptr<int[]> conditionArray = std::make_unique<int[]>(n);
+          const std::unique_ptr<int[]> scanArray = std::make_unique<int[]>(n);
+
 	        timer().startCpuTimer();
-	        // TODO
+
+          for (int idx = 0; idx < n; ++idx)
+          {
+            conditionArray[idx] = idata[idx] != 0 ? 1 : 0;
+          }
+
+          scanArray[0] = 0;
+          for (int idx = 1; idx < n; ++idx)
+          {
+            scanArray[idx] = scanArray[idx - 1] + conditionArray[idx - 1];
+          }
+
+          int outIdx = 0;
+          for (int idx = 0; idx < n; ++idx)
+          {
+            const int inputData = idata[idx];
+            if (conditionArray[idx] == 1)
+            {
+              outIdx = scanArray[idx];
+              odata[outIdx] = inputData;
+            }
+          }
+
 	        timer().endCpuTimer();
-            return -1;
+
+          return outIdx + 1;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 36c5ef2..a06a881 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,39 +2,359 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
+#include "naive.h"
+#include<cstdint>
 
-namespace StreamCompaction {
-    namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+#define blockSize 512
+
+namespace StreamCompaction
+{
+  namespace Efficient
+  {
+    int* device_idata;
+    int* device_bools;
+    int* device_scannedBools;
+    int* device_odata;
+    int numObjects;
+
+    void printArray(int n, int *a, bool abridged = false) {
+      printf("    [ ");
+      for (int i = 0; i < n; i++) {
+        if (abridged && i + 2 == 15 && n > 16) {
+          i = n - 2;
+          printf("... ");
         }
+        printf("%3d ", a[i]);
+      }
+      printf("]\n");
+    }
+
+    using StreamCompaction::Common::PerformanceTimer;
+
+    PerformanceTimer& timer()
+    {
+      static PerformanceTimer timer;
+      return timer;
+    }
+
+    __global__ void kernel_UpSweep(int N, int powDP1, int* idata)
+    {
+      const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+      if (index >= N) {
+        return;
+      }
+
+      if (index % powDP1 != 0)
+      {
+        return;
+      }
+
+      // x[k + 2d+1 � 1] += x[k + 2d � 1];
+      idata[index + powDP1 - 1] += idata[index + (powDP1 / 2) - 1];
+    }
 
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+    __global__ void kernel_UpSweepOptimized_v1(int N, int numThreads, int powD, int* idata)
+    {
+      extern __shared__ int temp[];
+
+      const int threadID = threadIdx.x;
+      const int threadID2X = 2 * threadIdx.x;
+      const int staticIdx = threadID2X + (blockIdx.x * N) + (powD - 1);
+
+      int offset = 1;
+
+      temp[threadID2X] = idata[staticIdx];
+      temp[threadID2X + 1] = idata[staticIdx + powD];
+
+      // build sum in place up the tree
+      for (int d = numThreads; d > 0; d >>= 1)
+      {
+        __syncthreads();
+        if (threadID < d)
+        {
+          const int ai = offset * (threadID2X + 1) - 1;
+          const int bi = offset * (threadID2X + 2) - 1;
+          temp[bi] += temp[ai];
         }
+      
+        offset *= 2;
+      }
+
+      __syncthreads();
+
+      idata[staticIdx] = temp[threadID2X];
+      idata[staticIdx + powD] = temp[threadID2X + 1];
+    }
+
+    __global__ void kernel_DownSweepOptimized_v1(int N, int numThreads, int powD, int* idata, int* odata)
+    {
+      extern __shared__ int temp[];
+
+      const int threadID = threadIdx.x;
+      const int threadID2X = 2 * threadIdx.x;
+      const int staticIdx = threadID2X + (blockIdx.x * N) + (powD - 1);
 
-        /**
-         * Performs stream compaction on idata, storing the result into odata.
-         * All zeroes are discarded.
-         *
-         * @param n      The number of elements in idata.
-         * @param odata  The array into which to store elements.
-         * @param idata  The array of elements to compact.
-         * @returns      The number of elements remaining after compaction.
-         */
-        int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-            return -1;
+      int offset = N / 2;
+
+      temp[threadID2X] = idata[staticIdx];
+      temp[threadID2X + 1] = idata[staticIdx + powD];
+
+      // traverse down tree & build scan
+      for (int d = 1; d <= numThreads; d *= 2)
+      {
+        offset >>= 1;
+      
+        __syncthreads();
+      
+        if (threadID < d)
+        {
+          const int ai = (threadID2X + 1) - 1;
+          const int bi = (threadID2X + 2) - 1;
+      
+          const int t = temp[ai];
+          temp[ai] = temp[bi];
+          temp[bi] += t;
         }
+      }
+
+      __syncthreads();
+
+      idata[staticIdx] = temp[threadID2X];
+      idata[staticIdx + powD] = temp[threadID2X + 1];
+    }
+
+    __global__ void kernel_DownSweep(int N, int powDP1, int* idata)
+    {
+      const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+      if (index >= N) {
+        return;
+      }
+
+      if (index % powDP1 != 0)
+      {
+        return;
+      }
+
+      // Calculate some indices
+      const int leftChildIdx = index + (powDP1 / 2) - 1;
+      const int rightChildIdx = index + powDP1 - 1;
+
+      // Save the left child
+      const int leftChild = idata[leftChildIdx];
+
+      // Set Left Child to Current Node's Value
+      idata[leftChildIdx] = idata[rightChildIdx];
+
+      // Set Right Child to Left + Right
+      idata[rightChildIdx] += leftChild;
+    }
+
+    /**
+     * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+     */
+    void scan(int n, int* odata, const int* idata)
+    {
+      numObjects = n;
+      const int logN = ilog2ceil(numObjects);
+      const int nearestPower2 = std::pow(2, logN);
+
+      cudaMalloc((void**)&device_idata, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+
+      cudaMemcpy(device_idata, idata, sizeof(int) * numObjects, cudaMemcpyHostToDevice);
+      checkCUDAError("cudaMemcpy failed!");
+
+      const int numBlocks = (numObjects + blockSize - 1) / blockSize;
+      dim3 fullBlocksPerGrid(numBlocks);
+
+      int* loopInputBuffer = device_idata;
+
+      // Up Sweep
+      timer().startGpuTimer();
+      for (int d = 0; d < logN; ++d)
+      {
+        const int powDP1 = std::pow(2, d + 1);
+        kernel_UpSweep<<<fullBlocksPerGrid, blockSize>>>(numObjects, powDP1, loopInputBuffer);
+      }
+      timer().endGpuTimer();
+
+      // Set x[n-1] = 0
+      // This seems really weird that we need to copy a 0 from host to the device.
+      // Might need to find a more efficient way.
+      const int lastValue = 0;
+      cudaMemcpy(&loopInputBuffer[nearestPower2 - 1], &lastValue, sizeof(int), cudaMemcpyHostToDevice);
+      
+      // Down Sweep
+      timer().startGpuTimer();
+      for (int d = logN - 1; d >= 0; --d)
+      {
+        const int powDP1 = std::pow(2, d + 1);
+        kernel_DownSweep<<<fullBlocksPerGrid, blockSize>>>(numObjects, powDP1, loopInputBuffer);
+      }
+      timer().endGpuTimer();
+
+      cudaMemcpy(odata, loopInputBuffer, sizeof(int) * (numObjects), cudaMemcpyDeviceToHost);
+
+      cudaFree(device_idata);
+    }
+
+    void scanOptimized_v1(int n, int* odata, const int* idata)
+    {
+      numObjects = n;
+      const int logN = ilog2ceil(numObjects);
+      const int nearestPower2 = std::pow(2, logN);
+
+      cudaMalloc((void**)&device_idata, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+      
+      cudaMalloc((void**)&device_odata, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+
+      cudaMemcpy(device_idata, idata, sizeof(int) * numObjects, cudaMemcpyHostToDevice);
+      checkCUDAError("cudaMemcpy failed!");
+
+      const int numBlocks = (numObjects + blockSize - 1) / blockSize;
+      dim3 fullBlocksPerGrid(numBlocks);
+
+      int* loopInputBuffer = device_idata;
+
+      const int numCount = nearestPower2;
+
+      int upSweepBlockCount = (numCount + blockSize - 1) / blockSize;
+      const int downSweepBlockCount = (numCount + blockSize - 1) / blockSize;
+
+      int depth = 0;
+
+      // Up Sweep
+      timer().startGpuTimer();
+      while(upSweepBlockCount > 0)
+      {
+        const int powD = std::pow(2, depth);
+        const int powD1 = std::pow(2, depth + 1);
+
+        dim3 upSweepBlocks(upSweepBlockCount);
+
+        const int numObjectsPerBlock = numCount / upSweepBlockCount;
+
+        const int threadsPerBlock = numObjectsPerBlock / powD1;
+
+        kernel_UpSweepOptimized_v1<<<upSweepBlocks, threadsPerBlock, 2 * threadsPerBlock * sizeof(int)>>>(numObjectsPerBlock, threadsPerBlock, powD, loopInputBuffer);
+
+        upSweepBlockCount /= 2;
+        depth = ilog2ceil(numObjectsPerBlock);
+      }
+      timer().endGpuTimer();
+
+      // Set x[n-1] = 0
+      // This seems really weird that we need to copy a 0 from host to the device.
+      // Might need to find a more efficient way.
+      const int lastValue = 0;
+      cudaMemcpy(&loopInputBuffer[nearestPower2 - 1], &lastValue, sizeof(int), cudaMemcpyHostToDevice);
+
+      // Down Sweep
+      timer().startGpuTimer();
+      for (int d = logN - 1; d >= 0; --d)
+      {
+        const int powDP1 = std::pow(2, d + 1);
+        kernel_DownSweep<<<fullBlocksPerGrid, blockSize>>>(numObjects, powDP1, loopInputBuffer);
+      }
+      timer().endGpuTimer();
+
+      cudaMemcpy(odata, device_odata, sizeof(int) * (numObjects), cudaMemcpyDeviceToHost);
+
+      cudaFree(device_idata);
+    }
+
+    /**
+     * Performs stream compaction on idata, storing the result into odata.
+     * All zeroes are discarded.
+     *
+     * @param n      The number of elements in idata.
+     * @param odata  The array into which to store elements.
+     * @param idata  The array of elements to compact.
+     * @returns      The number of elements remaining after compaction.
+     */
+    int compact(int n, int* odata, const int* idata)
+    {
+      numObjects = n;
+      const int logN = ilog2ceil(numObjects);
+      const int nearestPower2 = std::pow(2, logN);
+
+      cudaMalloc((void**)&device_idata, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+      
+      cudaMalloc((void**)&device_odata, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_odata failed!");
+
+      cudaMalloc((void**)&device_bools, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_bools failed!");
+      
+      cudaMalloc((void**)&device_scannedBools, nearestPower2 * sizeof(int));
+      checkCUDAError("cudaMalloc device_scannedBools failed!");
+
+      cudaMemcpy(device_idata, idata, sizeof(int) * numObjects, cudaMemcpyHostToDevice);
+
+      const int numBlocks = (numObjects + blockSize - 1) / blockSize;
+      dim3 fullBlocksPerGrid(numBlocks);
+
+      // 1. Get Bool Array 1st
+      timer().startGpuTimer();
+      Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>>(numObjects, device_bools, device_idata);
+      cudaMemcpy(device_scannedBools, device_bools, sizeof(int) * nearestPower2, cudaMemcpyDeviceToDevice);
+
+      // 2. Scan the Bool Array
+      int* loopInputBuffer = device_scannedBools;
+
+      // Up Sweep
+      for (int d = 0; d < logN; ++d)
+      {
+        const int powDP1 = std::pow(2, d + 1);
+        kernel_UpSweep<<<fullBlocksPerGrid, blockSize>>>(numObjects, powDP1, loopInputBuffer);
+      }
+      timer().endGpuTimer();
+
+      // Set x[n-1] = 0
+      // This seems really weird that we need to copy a 0 from host to the device.
+      // Might need to find a more efficient way.
+      const int lastValue = 0;
+      cudaMemcpy(&loopInputBuffer[nearestPower2 - 1], &lastValue, sizeof(int), cudaMemcpyHostToDevice);
+
+      // Down Sweep
+      timer().startGpuTimer();
+      for (int d = logN - 1; d >= 0; --d)
+      {
+        const int powDP1 = std::pow(2, d + 1);
+        kernel_DownSweep<<<fullBlocksPerGrid, blockSize>>>(numObjects, powDP1, loopInputBuffer);
+      }
+
+      // 3. Store in OData
+      Common::kernScatter<<<fullBlocksPerGrid, blockSize>>>(numObjects, device_odata, device_idata, device_bools, device_scannedBools);
+      timer().endGpuTimer();
+
+      int boolArrayLast = 0;
+      cudaMemcpy(&boolArrayLast, &device_bools[nearestPower2 - 1], sizeof(int), cudaMemcpyDeviceToHost);
+
+      int scannedLast = 0;
+      cudaMemcpy(&scannedLast, &device_scannedBools[nearestPower2 - 1], sizeof(int), cudaMemcpyDeviceToHost);
+
+      // Eg:
+      // 0101 is our bools
+      // Scanned: 0 0 1 1 (Exclusive)
+      //
+      // 01010 is our bools
+      // Scanned: 0 0 1 1 2
+      //
+      // So we add bools[last] + Scanned[last] to get final count.
+      // In 1st case: 1 + 1 = 2 entries (final compaction count)
+      // In 2nd case: 2 + 0 = 2 entries (final compaction count)
+      const int totalEntries = scannedLast + boolArrayLast;
+      cudaMemcpy(odata, device_odata, sizeof(int) * (totalEntries), cudaMemcpyDeviceToHost);
+
+      cudaFree(device_idata);
+      cudaFree(device_odata);
+      cudaFree(device_bools);
+      cudaFree(device_scannedBools);
+      return totalEntries;
     }
+  }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..338a2b9 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -7,6 +7,7 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+        void scanOptimized_v1(int n, int *odata, const int *idata);
 
         int compact(int n, int *odata, const int *idata);
     }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 9218f8e..39049e1 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,24 +2,132 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "naive.h"
+#include <valarray>
 
-namespace StreamCompaction {
-    namespace Naive {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+#define blockSize 512
+
+namespace StreamCompaction
+{
+  namespace Naive
+  {
+    int* device_idata;
+    int* device_odata;
+    int numObjects;
+
+    using StreamCompaction::Common::PerformanceTimer;
+
+    PerformanceTimer& timer()
+    {
+      static PerformanceTimer timer;
+      return timer;
+    }
+
+    __global__ void kernel_NaiveParallelScan(int N, int powD, int* odata, int* idata)
+    {
+      const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+      if (index >= N) {
+        return;
+      }
+
+      if (index < powD)
+      {
+        odata[index] = idata[index];
+        return;
+      }
+
+      odata[index] = idata[index - powD] + idata[index];
+    }
+
+    __global__ void kernel_NaiveSharedParallelScan(int N, int* odata, int* idata)
+    {
+      extern __shared__ float temp[];
+      const int index = threadIdx.x;
+      int pout = 0;
+      int pin = 1;
+
+      temp[pout * N + index] = (index > 0) ? idata[index - 1] : 0;
+      __syncthreads();
+
+      for (int offset = 1; offset < N; offset *= 2)
+      {
+        // swap double buffer indices
+        pout = 1 - pout;
+        pin = 1 - pout;
+
+        if (index >= offset) {
+          temp[pout * N + index] += temp[pin * N + index - offset];
         }
-        // TODO: __global__
-
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+        else {
+          temp[pout * N + index] = temp[pin * N + index];
         }
+        __syncthreads();
+      }
+
+      odata[index] = temp[pout * N + index]; // write output
+    }
+
+    /**
+     * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+     */
+    void scan(int n, int* odata, const int* idata)
+    {
+      numObjects = n;
+      cudaMalloc((void**)&device_idata, numObjects * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+
+      cudaMalloc((void**)&device_odata, numObjects * sizeof(int));
+      checkCUDAError("cudaMalloc device_odata failed!");
+
+      cudaMemcpy(device_idata, idata, sizeof(int) * numObjects, cudaMemcpyHostToDevice);
+
+      const int numBlocks = (numObjects + blockSize - 1) / blockSize;
+      dim3 fullBlocksPerGrid(numBlocks);
+
+      const int logN = ilog2ceil(numObjects);
+
+      int* loopInputBuffer = device_idata;
+      int* loopOutputBuffer = device_odata;
+
+      timer().startGpuTimer();
+      for (int d = 1; d <= logN; ++d)
+      {
+        const int powD = std::pow(2, d - 1);
+        kernel_NaiveParallelScan<<<fullBlocksPerGrid, blockSize>>>(numObjects - 1, powD, loopOutputBuffer, loopInputBuffer);
+
+        int* temp = loopInputBuffer;
+        loopInputBuffer = loopOutputBuffer;
+        loopOutputBuffer = temp;
+      }
+      timer().endGpuTimer();
+
+      cudaMemcpy((odata + 1), loopInputBuffer, sizeof(int) * (numObjects - 1), cudaMemcpyDeviceToHost);
+
+      cudaFree(device_idata);
+      cudaFree(device_odata);
+    }
+
+    void scanShared(int n, int* odata, const int* idata)
+    {
+      numObjects = n;
+      cudaMalloc((void**)&device_idata, numObjects * sizeof(int));
+      checkCUDAError("cudaMalloc device_idata failed!");
+
+      cudaMalloc((void**)&device_odata, numObjects * sizeof(int));
+      checkCUDAError("cudaMalloc device_odata failed!");
+
+      cudaMemcpy(device_idata, idata, sizeof(int) * numObjects, cudaMemcpyHostToDevice);
+
+      const int numBlocks = (numObjects + blockSize - 1) / blockSize;
+      dim3 fullBlocksPerGrid(numBlocks);
+
+      timer().startGpuTimer();
+      kernel_NaiveSharedParallelScan<<<fullBlocksPerGrid, blockSize, numObjects * 2 * sizeof(int)>>>(numObjects, device_odata, device_idata);
+      timer().endGpuTimer();
+
+      cudaMemcpy(odata, device_odata, sizeof(int) * (numObjects), cudaMemcpyDeviceToHost);
+
+      cudaFree(device_idata);
+      cudaFree(device_odata);
     }
+  }
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..3c7a284 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -7,5 +7,6 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+        void scanShared(int n, int *odata, const int *idata);
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 36b732d..ff7ca6a 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -6,23 +6,35 @@
 #include "common.h"
 #include "thrust.h"
 
-namespace StreamCompaction {
-    namespace Thrust {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
-            timer().endGpuTimer();
-        }
+namespace StreamCompaction
+{
+  namespace Thrust
+  {
+    using StreamCompaction::Common::PerformanceTimer;
+
+    PerformanceTimer& timer()
+    {
+      static PerformanceTimer timer;
+      return timer;
+    }
+
+    /**
+     * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+     */
+    void scan(int n, int* odata, const int* idata)
+    {
+      // TODO use `thrust::exclusive_scan`
+      // example: for device_vectors dv_in and dv_out:
+      // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+      thrust::device_vector<int> thrust_device_idata(idata, idata + n);
+      thrust::device_vector<int> thrust_device_odata = thrust::device_vector<int>(n, 0);
+      
+      timer().startGpuTimer();
+      thrust::exclusive_scan(thrust_device_idata.begin(), thrust_device_idata.end(), thrust_device_odata.begin());
+      timer().endGpuTimer();
+
+      thrust::copy(thrust_device_odata.begin(), thrust_device_odata.end(), odata);
     }
+  }
 }