diff --git a/README.md b/README.md
index d63a6a1..ded11bd 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,89 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* John Marcao
+  * [LinkedIn](https://www.linkedin.com/in/jmarcao/)
+* Tested on: Windows 10, i5-4690K @ 3.50GHz, 8GB DDR3, RTX 2080 TI 3071MB (Personal)
+
+# Boids - Writing Efficient CUDA Code
+
+## Overview
+
+This project covered parallel algorithms, CUDA kernel programming, kernel debugging, and performance analysis. 
+
+| 5,000 Boids | 50,000 Boids | 200,000 Boids |
+| ------------- | ----------- | ----------- |
+| ![](images/5000_boids.gif)  | ![](images/50000_boids.gif) | ![](images/200000_boids.gif) |
+
+This project runs a simple simulation of bird-like Boids that follow a few simple rules for movement. The rules are sumarized as **seperation**, **alignment**, and **cohesion**.
+
+* Separation - Boids wat to fly away from other nearby boids.
+* Cohesion - Boids want to fly towards the center of mass of nearby boids.
+* Alignment - Boids want to fly in the same direction as other nearby boids.
+
+Each of these rules are given a weight and applied to each boid based on its neighbors. The effects of nearby boids can be modified through two parameters for each rule: distance and scale. In this implementation, each rule has an absolute distance where the rule will apply. Also, the effects of each rule scale based on a weight value. Although helpful in understanding the simulation, these values will not be tested.
+
+The focus of this project is to implement three variations of a Boids simulation.
+
+* Naive - Each boid looks at all other boids in the simulation space and applies each rule to each boid.
+* Uniform Spatial Grid (USG) - Each boid looks at other boids located in neighboring cells. This reduces the search space for each boid by elimnating boids that would be ignored due to the distance rules.
+* Uniform Spatial Grid (USG) with Coherent Memory - Boids behave the same as in a Uniform Grid, but data is rearranged by the GPU prior to the Boid looking at it. This allows memory accesses to be localized.
+
+Beyond this point the abstract Boids start to fail and we move to more concrete terms. Below is an example of a Spatial Grid containing some boids. A naive implementation will have each boid (each thread executing on a GPU) check each other boid in memory. This leads to a lot of wasted memory access. For example, Boid 4 will read the position of Boid 6 and then decide that the boid is too far for any rules to apply. The time spent reaching out to memory to get information on Boid 6 is wasted.
+
+![](images/Boids%20Ugrid%20base.png)
+
+We introduce an additional step to have boids only inspect other boids in neighboring cells. The neighborhood is defined by maximum distance that any of the boid's rules apply. This can be seen in the picture below. Boid 9 will only read information about Boids 0, 1, and 4. Boid 4 is still outside the range of any rules, but this is still a great improvement over analyzing all boids.
+
+![](images/Boids%20Ugrid%20neighbor%20search%20shown.png)
+
+To access memory on a cell-basis, we can set up multiple arrays of sorted pointers and leverage the parallel execution of a GPU to quickly sort our data. We first look at each boid and calculate which cell it is in and store this information in an array. We then sort the array on the cell index. We then walk over the sorted array and store the begining and ending indices for each cell. This will allow a boid to detemine its own cell and then find the pointers to each of its neighbors by insepcting the start and stop indicies for each cell.
+
+![](images/Boids%20Ugrids%20buffers%20naive.png)
+
+This can be further imroved by sorting position and velocity data for each boid to line up with the same indicies used in the Grid cell array. This allows us to eliminate one more memory read from each thread while also keeping all boid data coherent. This gives some major performance improvements that we will explore in the next section.
+
+![](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+## Performance Analysis
+
+To get a deeper understaining of CUDA programming, I ran several performance tests over a variety of settings and options. Some of the results are obvious, while some take a bit more understanding of GPU architecture to fully understand. Data was collected using the CUDA Event API.
+
+First, I modified the number of boids simulated by each algorithm. This is the most straightforward and obvious test. By increasing the number of boids, we increase the number of threads that must be executed and the ammount of data each thread has to read. We can see here that the coherent memory access has a huge benefit as we move up to 500,000 boids. The USG (Scattered) implementation spends most of its time waiting for memory reads from global memory. By using coherent data, we allow the GPU to optimize memory accesses.
+
+| | |
+| ------------- | ----------- |
+| ![](images/fps_by_boid.JPG) | ![](images/fps_by_boid_table.JPG) |
+
+Next I modified the block size with each algorithm. I also ran this test under 5,000 and 50,000 boid simulations to better understand it. Looking at the data, it seems like block sizes up to 256 have a neglible, if minor, effect on performance. This is a change to how the GPU processess the data. These results tell me that, at least up to a blocksize of 256, there is no bottleneck in the way the GPU is organizing threads.
+
+| | |
+| ------------- | ----------- |
+| ![](images/fps_by_block.JPG) | ![](images/fps_by_block_table.JPG) |
+
+Lastly, I looked at the effects of increasing the grid size used by each algorithm. By modifying the grid size, we can add more or less cells to the search space of each boid. Looking at the collected data, using a grid size of 2 times the maximum search space provides the best performance. Going below or above that value has performance penalties. When using a smaller grid size, each boid must inspect more grids. Since the grid data is located coherently, checking within a grid is cheap. However, checking between grids carries a penalty. Conversely, by increasing the cell width to 3 or 4 times the Max Rule Distance, more data is stored coherently but also more boids must be checked. Any boid not included in the 2X rule will be ultimately ignored because it will fall outside the Max Rule Distance. A grid size of 2X the Max Rule Distance is ideal because a boid will never have to inspect a cell it has no chance of finding a neighbor in.
+
+| | |
+| ------------- | ----------- |
+| ![](images/fps_by_gridsize.JPG) | ![](images/fps_by_gridsize_table.JPG) |
+
+## Questions
+
+**For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+
+Increasing the number of boids reduces performance for each implementation, although the effects are greater for the Naive implementation. When using a naive search, each boid must be checked, so adding any number of boids increases the ammount of work each thread has to do. Adding more boids to the USG implementations has a smaller effect when jumping from 5,000 to 10,000 boids. This is because each thread has a much smaller search space and adding 5,000 additional boids spread randomly through the simulation space doesn't cause a noticeable performance loss. However, when moving up to 50,000 and above boids, a larger loss is measured. At those levels, the Coherent USG imeplementation performs better. Since boids are located together in memory, the GPU can optimize any accesses.
+
+**For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+I did not see a big effect from modifying the block count and size. This makes me think that the bottleneck in performance is unrelated to the block count. Reduciong block sizes further can cause poor resource utilization on the GPU.
+
+**For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, as can be seen from the above performance data, having coherent data gave a big boost to performance, especially at high boid counts. When data is located nearby, the GPU can perform more efficient memory reads. Also, since the algorithm is using the dev_coherent_pos array, indexes are nearby and therefore warps will be accessing similar data. The GPU can perform one large read and reuse data stored in the SM local memory.
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not?**
+From the analysis section above:
+
+> By modifying the grid size, we can add more or less cells to the search space of each boid. Looking at the collected data, using a grid size of 2 times the maximum search space provides the best performance. Going below or above that value has performance penalties. When using a smaller grid size, each boid has to inspect more grids. Since the grid data is located coherently, checking within a grid is cheap. However, checking between grids carries a penalty. Conversely, by increasing the cell width to 3 or 4 times the Max Rule Distance, more data is stored coherently but also more boids have to be checked. Any boid not included in the 2X rule will be ultimately ignored because it will fall outside the Max Rule Distance. A grid size of 2X the Max Rule Distance is ideal because a boid will never have to inspect a cell it has no chance of finding a neighbor in.
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/200000_boids.gif b/images/200000_boids.gif
new file mode 100644
index 0000000..1551224
Binary files /dev/null and b/images/200000_boids.gif differ
diff --git a/images/50000_boids.gif b/images/50000_boids.gif
new file mode 100644
index 0000000..ff14480
Binary files /dev/null and b/images/50000_boids.gif differ
diff --git a/images/5000_boids.gif b/images/5000_boids.gif
new file mode 100644
index 0000000..1cd4d9d
Binary files /dev/null and b/images/5000_boids.gif differ
diff --git a/images/fps_by_block.JPG b/images/fps_by_block.JPG
new file mode 100644
index 0000000..ef4764d
Binary files /dev/null and b/images/fps_by_block.JPG differ
diff --git a/images/fps_by_block_table.JPG b/images/fps_by_block_table.JPG
new file mode 100644
index 0000000..694bf38
Binary files /dev/null and b/images/fps_by_block_table.JPG differ
diff --git a/images/fps_by_boid.JPG b/images/fps_by_boid.JPG
new file mode 100644
index 0000000..d29d14b
Binary files /dev/null and b/images/fps_by_boid.JPG differ
diff --git a/images/fps_by_boid_table.JPG b/images/fps_by_boid_table.JPG
new file mode 100644
index 0000000..21f419c
Binary files /dev/null and b/images/fps_by_boid_table.JPG differ
diff --git a/images/fps_by_gridsize.JPG b/images/fps_by_gridsize.JPG
new file mode 100644
index 0000000..8f620c5
Binary files /dev/null and b/images/fps_by_gridsize.JPG differ
diff --git a/images/fps_by_gridsize_table.JPG b/images/fps_by_gridsize_table.JPG
new file mode 100644
index 0000000..6ab74c7
Binary files /dev/null and b/images/fps_by_gridsize_table.JPG differ
diff --git a/images/naive_impl.gif b/images/naive_impl.gif
new file mode 100644
index 0000000..82b0f48
Binary files /dev/null and b/images/naive_impl.gif differ
diff --git a/measurements/README.txt b/measurements/README.txt
new file mode 100644
index 0000000..0b0a3c2
--- /dev/null
+++ b/measurements/README.txt
@@ -0,0 +1,2 @@
+Naive Boids Implementation
+Avg. 7ms/F (142FPS)
\ No newline at end of file
diff --git a/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvact b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvact
new file mode 100644
index 0000000..6cfc3e4
--- /dev/null
+++ b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvact
@@ -0,0 +1,128 @@
+﻿<Model DefinitionId="3a3ca3f4-7ba7-4c09-a182-6ba59bf12599" DisplayName="Analysis Activity" xmlns="clr-namespace:Ark.PropertyModel;assembly=Ark">
+    <Property Name="Application" Value="C:\Users\John\git\Project1-CUDA-Flocking\build\bin\Release\cis565_boids.exe" />
+    <Property Name="EnableCudaApiContextAttachDetach" Value="True" />
+    <Property Name="EnableCudaApiContextManagement" Value="True" />
+    <Property Name="EnableCudaApiD3D10Interop" Value="True" />
+    <Property Name="EnableCudaApiD3D11Interop" Value="True" />
+    <Property Name="EnableCudaApiD3D9Interop" Value="True" />
+    <Property Name="EnableCudaApiDeviceManagement" Value="True" />
+    <Property Name="EnableCudaApiEventManagement" Value="True" />
+    <Property Name="EnableCudaApiExecutionControl" Value="True" />
+    <Property Name="EnableCudaApiExternalResourceInterop" Value="True" />
+    <Property Name="EnableCudaApiGraphManagement" Value="True" />
+    <Property Name="EnableCudaApiGraphicsInterop" Value="True" />
+    <Property Name="EnableCudaApiMemoryManagement" Value="True" />
+    <Property Name="EnableCudaApiMemoryOperations" Value="True" />
+    <Property Name="EnableCudaApiModuleManagement" Value="True" />
+    <Property Name="EnableCudaApiOpenGlInterop" Value="True" />
+    <Property Name="EnableCudaApiPeerAccess" Value="True" />
+    <Property Name="EnableCudaApiProfiler" Value="True" />
+    <Property Name="EnableCudaApiStreamManagement" Value="True" />
+    <Property Name="EnableCudaApiSurfaceObjectManagement" Value="True" />
+    <Property Name="EnableCudaApiSurfaceReferenceManagement" Value="True" />
+    <Property Name="EnableCudaApiTextureObjectManagement" Value="True" />
+    <Property Name="EnableCudaApiTextureReferenceManagement" Value="True" />
+    <Property Name="EnableCudaApiTrace" Value="True" />
+    <Property Name="EnableCudaApiUnifiedAddressing" Value="True" />
+    <Property Name="EnableCudaCallbackTrace" Value="True" />
+    <Property Name="EnableCudaGpuTaskTrace" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D10Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D11Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D9Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiDeviceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiErrorHandling" Value="True" />
+    <Property Name="EnableCudaRuntimeApiEventManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiExecutionControl" Value="True" />
+    <Property Name="EnableCudaRuntimeApiExternalResourceInterop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiGraphManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiGraphicsInterop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiMemoryManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiMemoryOperations" Value="True" />
+    <Property Name="EnableCudaRuntimeApiOpenGlInterop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiPeerAccess" Value="True" />
+    <Property Name="EnableCudaRuntimeApiProfiler" Value="True" />
+    <Property Name="EnableCudaRuntimeApiStreamManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiSurfaceObjectManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiSurfaceReferenceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTextureObjectManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTextureReferenceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTrace" Value="True" />
+    <Property Name="EnableCudaRuntimeApiUnifiedAddressing" Value="True" />
+    <Property Name="EnableCudaRuntimeApiVersionManagement" Value="True" />
+    <Property Name="EnableCudaSoftwareCounters" Value="True" />
+    <Property Name="EnableCudaTrace" Value="True" />
+    <Property Name="EnableDirectXApiBlit" Value="True" />
+    <Property Name="EnableDirectXApiClear" Value="True" />
+    <Property Name="EnableDirectXApiCommandList" Value="True" />
+    <Property Name="EnableDirectXApiDispatch" Value="True" />
+    <Property Name="EnableDirectXApiLock" Value="True" />
+    <Property Name="EnableDirectXApiPresent" Value="True" />
+    <Property Name="EnableDirectXApiRender" Value="True" />
+    <Property Name="EnableDirectXApiTrace" Value="True" />
+    <Property Name="EnableDirectXShaderCompileTimes" Value="True" />
+    <Property Name="EnableDirectXWorkloadCopies" Value="True" />
+    <Property Name="EnableDirectXWorkloadCpuFrames" Value="True" />
+    <Property Name="EnableDirectXWorkloadDispatches" Value="True" />
+    <Property Name="EnableDirectXWorkloadDrawCalls" Value="True" />
+    <Property Name="EnableDirectXWorkloadGpuFrames" Value="True" />
+    <Property Name="EnableDirectXWorkloadPushBuffers" Value="True" />
+    <Property Name="EnableOpenClApiCommandQueue" Value="True" />
+    <Property Name="EnableOpenClApiContext" Value="True" />
+    <Property Name="EnableOpenClApiD3D10Interop" Value="True" />
+    <Property Name="EnableOpenClApiD3D11Interop" Value="True" />
+    <Property Name="EnableOpenClApiD3D9Interop" Value="True" />
+    <Property Name="EnableOpenClApiDevice" Value="True" />
+    <Property Name="EnableOpenClApiEnqueueCommands" Value="True" />
+    <Property Name="EnableOpenClApiEventObject" Value="True" />
+    <Property Name="EnableOpenClApiFlushAndFinish" Value="True" />
+    <Property Name="EnableOpenClApiKernelObject" Value="True" />
+    <Property Name="EnableOpenClApiMemoryObject" Value="True" />
+    <Property Name="EnableOpenClApiOpenGlInterop" Value="True" />
+    <Property Name="EnableOpenClApiPlatform" Value="True" />
+    <Property Name="EnableOpenClApiProgramObject" Value="True" />
+    <Property Name="EnableOpenClApiSampler" Value="True" />
+    <Property Name="EnableOpenClApiTrace" Value="True" />
+    <Property Name="EnableOpenClCommandTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramBuildCallbackBinaryCodeTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramBuildCallbackTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramSourceCodeTrace" Value="True" />
+    <Property Name="EnableOpenClResourceRefCountTrace" Value="True" />
+    <Property Name="EnableOpenClResourceTrace" Value="True" />
+    <Property Name="EnableOpenGlApiBegin" Value="True" />
+    <Property Name="EnableOpenGlApiClear" Value="True" />
+    <Property Name="EnableOpenGlApiDisplayList" Value="True" />
+    <Property Name="EnableOpenGlApiErrors" Value="True" />
+    <Property Name="EnableOpenGlApiEval" Value="True" />
+    <Property Name="EnableOpenGlApiFence" Value="True" />
+    <Property Name="EnableOpenGlApiFlush" Value="True" />
+    <Property Name="EnableOpenGlApiFrameBufferObject" Value="True" />
+    <Property Name="EnableOpenGlApiGet" Value="True" />
+    <Property Name="EnableOpenGlApiLight" Value="True" />
+    <Property Name="EnableOpenGlApiPath" Value="True" />
+    <Property Name="EnableOpenGlApiPixel" Value="True" />
+    <Property Name="EnableOpenGlApiProgram" Value="True" />
+    <Property Name="EnableOpenGlApiRaster" Value="True" />
+    <Property Name="EnableOpenGlApiState" Value="True" />
+    <Property Name="EnableOpenGlApiTexture" Value="True" />
+    <Property Name="EnableOpenGlApiTrace" Value="True" />
+    <Property Name="EnableOpenGlApiTransform" Value="True" />
+    <Property Name="EnableOpenGlApiVertex" Value="True" />
+    <Property Name="EnableOpenGlApiVertexArray" Value="True" />
+    <Property Name="EnableOpenGlApiWgl" Value="True" />
+    <Property Name="EnableOpenGlTrace" Value="True" />
+    <Property Name="EnableOpenGlWorkloadCopies" Value="True" />
+    <Property Name="EnableOpenGlWorkloadCpuFrames" Value="True" />
+    <Property Name="EnableOpenGlWorkloadDispatches" Value="True" />
+    <Property Name="EnableOpenGlWorkloadDrawCalls" Value="True" />
+    <Property Name="EnableOpenGlWorkloadGpuFrames" Value="True" />
+    <Property Name="EnableOpenGlWorkloadPushBuffers" Value="True" />
+    <Property Name="EnableOpenVrApiTrace" Value="True" />
+    <Property Name="EnableSystemCpuThreadTrace" Value="True" />
+    <Property Name="EnableSystemModuleTrace" Value="True" />
+    <Property Name="EnableToolsExtMarkerTrace" Value="True" />
+    <Property Name="EnableToolsExtPushPopRangeTrace" Value="True" />
+    <Property Name="EnableToolsExtResourceNaming" Value="True" />
+    <Property Name="EnableToolsExtStartEndRangeTrace" Value="True" />
+    <Property Name="ExtensionsToSynchronize" Value="*;" />
+    <Property Name="WorkingDirectory" Value="C:\Users\John\git\Project1-CUDA-Flocking\build\" />
+</Model>
\ No newline at end of file
diff --git a/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvevents b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvevents
new file mode 100644
index 0000000..1661c7a
Binary files /dev/null and b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvevents differ
diff --git a/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvreport b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvreport
new file mode 100644
index 0000000..373179f
--- /dev/null
+++ b/measurements/naive_initial/cis565_boids190906_001_Capture_000.nvreport
@@ -0,0 +1,16 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<AnalysisReport Version="1.0">
+  <ProcessId>80272</ProcessId>
+  <ReportCreationTime>2019-09-06T21:51:24.0976159-04:00</ReportCreationTime>
+  <ReportDirectory>c:\temp\cis565_boids190906_001\cis565_boids190906_001_Capture_000</ReportDirectory>
+  <ReportFile>c:\temp\cis565_boids190906_001\cis565_boids190906_001_Capture_000\cis565_boids190906_001_Capture_000.nvreport</ReportFile>
+  <CaptureStartQpcTimestamp>7734727040782</CaptureStartQpcTimestamp>
+  <CaptureStartNveTimestamp>2707187408000426</CaptureStartNveTimestamp>
+  <CaptureStartNveFrequency>3500000000</CaptureStartNveFrequency>
+  <CaptureEndQpcTimestamp>7735810591628</CaptureEndQpcTimestamp>
+  <CaptureEndNveTimestamp>2707566650164692</CaptureEndNveTimestamp>
+  <CaptureStartRdtscTimestamp>2707187408000480</CaptureStartRdtscTimestamp>
+  <CaptureEndRdtscTimestamp>2707566650165055</CaptureEndRdtscTimestamp>
+  <CaptureStartFileTimeTimestamp>132122946842638128</CaptureStartFileTimeTimestamp>
+  <CaptureEndFileTimeTimestamp>132122947926197821</CaptureEndFileTimeTimestamp>
+</AnalysisReport>
\ No newline at end of file
diff --git a/measurements/proj1_performance.xlsx b/measurements/proj1_performance.xlsx
new file mode 100644
index 0000000..2be2ef4
Binary files /dev/null and b/measurements/proj1_performance.xlsx differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..f949c6c 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -6,6 +6,9 @@
 #include "utilityCore.hpp"
 #include "kernel.h"
 
+#include <cuda_runtime_api.h>
+#include <device_launch_parameters.h>
+
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
@@ -74,17 +77,23 @@ glm::vec3 *dev_vel2;
 // pointers on your own too.
 
 // For efficient sorting and the uniform grid. These should always be parallel.
+//Buffer containing a pointer for each boid to its data in dev_pos and dev_vel1 and dev_vel2
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
+// Buffer containing the grid index for each boid.
 int *dev_particleGridIndices; // What grid cell is this particle in?
 // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
+// Buffer containing a pointer for each cell to the begining of dev_particleArrayIndices
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
+// Buffer containing a pointer for each cell to the end of dev_particleArrayIndices
 int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_coherent_pos;
+glm::vec3 *dev_coherent_vel1;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -153,7 +162,7 @@ void Boids::initSimulation(int N) {
 
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
+ dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
@@ -169,6 +178,30 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  //Uniform Grid Buffers
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  // Coherent Buffers
+  cudaMalloc((void**)&dev_coherent_pos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_coherent_pos failed!");
+
+  cudaMalloc((void**)&dev_coherent_vel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_coherent_vel1 failed!");
+
+  // Thrust buffers, used for prallel sorting
+  dev_thrust_particleArrayIndices = thrust::device_pointer_cast<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices  = thrust::device_pointer_cast<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -223,6 +256,190 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
+/**
+* boidCohesionRuleNaive()
+* boids move towards the perceived center of mass of their neighbors
+* 
+* Cohesion depends soley on the position of each boid, so we want to calculate the center of mass.
+* Assuming each boid weighs the same, the center of mass is simply the average position.
+* Therefore, we add each component of each boid and divide.
+* NOTE: For the Naive implementation, this means each thread is going to be doing the same exact work.
+*       That is super bad and goes against the idea of distributing work. This becomes a non-issue
+*       when each boid looks only at their local neighbors, as each boid will have a different subset
+*       to look at.
+*/
+__device__ glm::vec3 boidCohesionRuleNaive(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 perceived_center(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = 0; i < N; i++) {
+		if ((i != iSelf) && (glm::distance(selfPos, pos[i]) < rule1Distance)) {
+			perceived_center += pos[i];
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		perceived_center /= neighbors;
+		result = (perceived_center - selfPos) * rule1Scale;
+	}
+
+	return result;
+}
+
+
+/**
+* boidCohesionRuleGrid()
+* boids move towards the perceived center of mass of their neighbors
+*
+* Cohesion depends soley on the position of each boid, so we want to calculate the center of mass.
+* Assuming each boid weighs the same, the center of mass is simply the average position.
+* Therefore, we add each component of each boid and divide.
+*/
+__device__ glm::vec3 boidCohesionRuleGrid(int N, int iSelf, const int *boidIndices, int b_start, int b_end, glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 perceived_center(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = b_start; i < b_end; i++) {
+		int boid_idx = boidIndices[i];
+		if ((boid_idx != iSelf) && (glm::distance(selfPos, pos[boid_idx]) < rule1Distance)) {
+			perceived_center += pos[boid_idx];
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		perceived_center /= neighbors;
+		result = (perceived_center - selfPos) * rule1Scale;
+	}
+
+	return result;
+}
+
+/**
+* boidSeperationRuleNaive()
+* boids avoid getting to close to their neighbors
+*
+* In this rule, the boid is repulsed by nearby boids. To represent that, we take the distance
+* between the boid and the neighbor boids and add the disance between the two as a sacled negative velocity.
+* This has the effect of pushing each boid away from his neighbors. Note that a boid on either side will contribute
+* to opposite directions.
+*/
+__device__ glm::vec3 boidSeperationRuleNaive(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 seperation(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = 0; i < N; i++) {
+		if ((i != iSelf) && (glm::distance(selfPos, pos[i]) < rule2Distance)) {
+			seperation -= pos[i] - selfPos;
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		result = seperation * rule2Scale;
+	}
+
+	return result;
+}
+
+/**
+* boidSeperationRuleGrid()
+* boids avoid getting to close to their neighbors
+*
+* In this rule, the boid is repulsed by nearby boids. To represent that, we take the distance
+* between the boid and the neighbor boids and add the disance between the two as a sacled negative velocity.
+* This has the effect of pushing each boid away from his neighbors. Note that a boid on either side will contribute
+* to opposite directions.
+*/
+__device__ glm::vec3 boidSeperationRuleGrid(int N, int iSelf, const int *boidIndices, int b_start, int b_end, glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 seperation(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = b_start; i < b_end; i++) {
+		int boid_idx = boidIndices[i];
+		if ((boid_idx != iSelf) && (glm::distance(selfPos, pos[boid_idx]) < rule2Distance)) {
+			seperation -= pos[boid_idx] - selfPos;
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		result = seperation * rule2Scale;
+	}
+
+	return result;
+}
+
+
+/**
+* boidAlignmentRuleNaive()
+* boids generally try to move with the same direction and speed as their neighbors
+*
+* Boids want to match the velocit of their neighbors at t=a, so they will adjust their velocity accordingly.
+* After each round, at t=a+dt, each boid will apply their change.
+*/
+__device__ glm::vec3 boidAlignmentRuleNaive(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 perceived_velocity(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	glm::vec3 selfVelocity = vel[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = 0; i < N; i++) {
+		if ((i != iSelf) && (glm::distance(selfPos, pos[i]) < rule3Distance)) {
+			perceived_velocity += vel[i];
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		perceived_velocity /= neighbors;
+		result = perceived_velocity * rule3Scale;
+	}
+
+	return result;
+}
+
+/**
+* boidAlignmentRuleGrid()
+* boids generally try to move with the same direction and speed as their neighbors
+*
+* Boids want to match the velocit of their neighbors at t=a, so they will adjust their velocity accordingly.
+* After each round, at t=a+dt, each boid will apply their change.
+*/
+__device__ glm::vec3 boidAlignmentRuleGrid(int N, int iSelf, const int *boidIndices, int b_start, int b_end, glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 result(0.0f);
+	glm::vec3 perceived_velocity(0.0f);
+	glm::vec3 selfPos = pos[iSelf];
+	glm::vec3 selfVelocity = vel[iSelf];
+	float neighbors = 0.0f;
+
+	for (int i = b_start; i < b_end; i++) {
+		int boid_idx = boidIndices[i];
+		if ((boid_idx != iSelf) && (glm::distance(selfPos, pos[boid_idx]) < rule3Distance)) {
+			perceived_velocity += vel[boid_idx];
+			neighbors++;
+		}
+	}
+
+	if (neighbors) {
+		perceived_velocity /= neighbors;
+		result = perceived_velocity * rule3Scale;
+	}
+
+	return result;
+}
+
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -233,7 +450,15 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	glm::vec3 delta(0.0f);
+
+	// Apply each rule.
+	delta += boidCohesionRuleNaive(N, iSelf, pos, vel);
+	delta += boidSeperationRuleNaive(N, iSelf, pos, vel);
+	delta += boidAlignmentRuleNaive(N, iSelf, pos, vel);
+
+	return delta;
 }
 
 /**
@@ -244,7 +469,25 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+  // Record the new velocity into vel2. Question: why NOT vel1? 
+  // Answer: Other threads may still be reading vel1!!
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+	glm::vec3 curntV = vel1[index];
+	glm::vec3 deltaV = computeVelocityChange(N, index, pos, vel1);
+	glm::vec3 newV = curntV + deltaV;
+
+	// Clamp the speed. We do it this way to ensure that the total velocity is clamped,
+	// not just the velocity in each direction (otherwise glm::clamp would be nice).
+	if (glm::length(newV) > maxSpeed) {
+		newV = glm::normalize(newV) * maxSpeed;
+	}
+	
+	vel2[index] = newV;
 }
 
 /**
@@ -289,6 +532,28 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		glm::vec3 myGridPos = pos[index];
+		int myGridIdx = 0;
+
+		// Add grid minimum to move +/- halfWidth range to 0 - fullWidth range.
+		myGridPos -= gridMin;
+		// Cells are cubes, so all dimensions are identical, divide each pos by cell width
+		myGridPos *= inverseCellWidth;
+		// Round down to throw away float garbage!
+		myGridPos = glm::floor(myGridPos);
+
+		// Compute a 1D index from the 3D index
+		myGridIdx = gridIndex3Dto1D(myGridPos.x, myGridPos.y, myGridPos.z, gridResolution);
+
+		// Store the grid index in the indices buffer using the boid IDX as the key
+		// and the index in and index buffer. These two will be sorted in parallel.
+		// The end result will be that indices will be sorted by myGridIdx so consecutive
+		// boids will have their indices colocated in memory.
+		gridIndices[index] = myGridIdx;
+		indices[index] = index;
+	}
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +571,21 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		// Collect some information
+		int myGridIdx = particleGridIndices[index];
+
+		// Always Start if first cell
+		if ((index == 0) || particleGridIndices[index] != particleGridIndices[index - 1]) {
+			gridCellStartIndices[myGridIdx] = index; // Start of grid myGridIdx is boid at index
+		}
+
+		// Always End if last cell
+		if ((index == (N - 1)) || (particleGridIndices[index] != particleGridIndices[index + 1])) {
+			gridCellEndIndices[myGridIdx] = index;
+		}
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +602,133 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		// Identify the grid cell that this particle is in.
+		glm::vec3 myPos = pos[index];
+		glm::vec3 myGridPos = myPos - gridMin;
+		myGridPos *= inverseCellWidth;     // Just like in kernComputeIndices()
+		myGridPos = glm::floor(myGridPos); // Just like in kernComputeIndices()
+
+		// Identify which cells contain neighbors
+		// Want to find each grid where one of our rules can apply. Therefore, we need to know the distance of our rules.
+		float neighbor_distance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+		// Create a vector with this value.
+		glm::vec3 neighbor_distance_vec(neighbor_distance);
+		// Remember, everything is a cube! If we check the corners of the cube, we will know
+		// what the rest of the cube looks like. ie, if min is {0, 0, 0} and max is {1, 1, 1},
+		// we know where each of the other corners lie. (Also true for rectangluar cubes)
+		glm::vec3 min_neighbor_cell = glm::floor((myPos - gridMin - neighbor_distance) * inverseCellWidth);
+		glm::vec3 max_neighbor_cell = glm::floor((myPos - gridMin + neighbor_distance) * inverseCellWidth);
+		// Now, if myPos is situated on any of the axis of the cell, then min and/or max may not change.
+		// This is clear from the case where myPos is in the middle of the cell. In that case, if the cellWidth
+		// is equal to the neighbor_disance, then the cube will consist of only the cell.
+		// So we don't need to check the cell interior edge conditions, those are covered already.
+
+		// Issue: What about the edge of the grid? We need to wrap around.
+		// We can handle this during the search. If any cell value exceeds the resolution of the grid,
+		// then we loop around.
+		glm::vec3 min_cell_search = glm::clamp(min_neighbor_cell, glm::vec3(0), glm::vec3(gridResolution));
+		glm::vec3 max_cell_search = glm::clamp(max_neighbor_cell, glm::vec3(0), glm::vec3(gridResolution));
+
+		// After all that work, we now start applying rules! Instead of searching over N boids, we will search over the boids
+		// in each cell between min_cell_search and max_cell_search only.
+		// I can already see how making the boids cohenernt in memory helps simplify this, but that's for later.
+		glm::vec3 velocity_change(0.0f);
+		glm::vec3 selfPos = pos[index];
+		glm::vec3 selfVelocity = vel1[index];
+		glm::vec3 alignment_perceived_velocity(0.0f);
+		glm::vec3 cohesion_perceived_center(0.0f);
+		glm::vec3 seperation(0.0f);
+		int alignment_neighbors = 0.0;
+		int cohesion_neighbors = 0.0;
+		int seperation_neighbors = 0.0;
+
+		for (float z = min_cell_search.z; z <= max_cell_search.z; z++) {
+			for (float y = min_cell_search.y; y <= max_cell_search.y; y++) {
+				for (float x = min_cell_search.x; x <= max_cell_search.x; x++) {
+					// Yikes! Triple for loop??? Not that bad, min and max differ by at most one.
+					// Calculate grid index of the grid under inspection
+					int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution);
+
+					// Get the start and end indices for the boids
+					int b_start = gridCellStartIndices[gridIdx];
+					int b_end   = gridCellEndIndices[gridIdx];
+
+					// Check if any values are -1, meaning an empty cell.
+					// Also check if less than N, don't want to cause a segfault.
+					if (b_start < 0 || b_start > N || b_end < 0 || b_end > N) {
+						continue;
+					}
+
+					// We now have the boids we need. Run each rule over the range of boids.
+					// WOW: Unrolling these calls gave a 2.27X performance boost!!!! Cache benefits?
+					//velocity_change += boidAlignmentRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+					//velocity_change += boidCohesionRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+					//velocity_change += boidSeperationRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+					
+					for (int i = b_start; i <= b_end; i++) {
+						int boid_idx = particleArrayIndices[i];
+						if (index == boid_idx) { // Dip out early.
+							continue;
+						}
+
+						// Get relevant data
+						glm::vec3 boid_pos = pos[boid_idx];
+						float distance = glm::distance(selfPos, boid_pos);
+
+						// Cohesion
+						if (distance < rule1Distance) {
+							cohesion_perceived_center += boid_pos;
+							cohesion_neighbors++;
+						}
+						// Seperation
+						if (distance < rule2Distance) {
+							seperation -= boid_pos - selfPos;
+							seperation_neighbors++;
+						}
+						// Alignment
+						if (distance < rule3Distance) {
+							alignment_perceived_velocity += vel1[boid_idx];
+							alignment_neighbors++;
+						}
+					}
+				}
+			}
+		}
+
+		// Finalize Cohesion values
+		if (cohesion_neighbors) {
+			cohesion_perceived_center /= cohesion_neighbors;
+			velocity_change += (cohesion_perceived_center - selfPos) * rule1Scale;
+		}
+		// Finalize Seperation Values
+		if (seperation_neighbors) {
+			velocity_change += seperation * rule2Scale;
+		}
+		// Finalize Alignment Values
+		if (alignment_neighbors) {
+			alignment_perceived_velocity /= alignment_neighbors;
+			velocity_change += alignment_perceived_velocity * rule3Scale;
+		}
+
+		// Calculated total velocity change! Now apply, clamp, and store.
+		glm::vec3 newV = selfVelocity + velocity_change;
+		if (glm::length(newV) > maxSpeed) {
+			newV = glm::normalize(newV) * maxSpeed;
+		}
+		vel2[index] = newV;
+	}
+}
+
+__global__ void kernRearrangeCoherentData(int N, int* indicies, glm::vec3 *scattered_pos, glm::vec3 *coherent_pos, glm::vec3 *scattered_vel1, glm::vec3 *coherent_vel1) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		int bIdx = indicies[index];
+
+		coherent_pos[index] = scattered_pos[bIdx];
+		coherent_vel1[index] = scattered_vel1[bIdx];
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,14 +748,138 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		// Identify the grid cell that this particle is in.
+		glm::vec3 myPos = pos[index];
+		glm::vec3 myGridPos = myPos - gridMin;
+		myGridPos *= inverseCellWidth;     // Just like in kernComputeIndices()
+		myGridPos = glm::floor(myGridPos); // Just like in kernComputeIndices()
+
+		// Identify which cells contain neighbors
+		// Want to find each grid where one of our rules can apply. Therefore, we need to know the distance of our rules.
+		float neighbor_distance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+		// Create a vector with this value.
+		glm::vec3 neighbor_distance_vec(neighbor_distance);
+		// Remember, everything is a cube! If we check the corners of the cube, we will know
+		// what the rest of the cube looks like. ie, if min is {0, 0, 0} and max is {1, 1, 1},
+		// we know where each of the other corners lie. (Also true for rectangluar cubes)
+		glm::vec3 min_neighbor_cell = glm::floor((myPos - gridMin - neighbor_distance) * inverseCellWidth);
+		glm::vec3 max_neighbor_cell = glm::floor((myPos - gridMin + neighbor_distance) * inverseCellWidth);
+		// Now, if myPos is situated on any of the axis of the cell, then min and/or max may not change.
+		// This is clear from the case where myPos is in the middle of the cell. In that case, if the cellWidth
+		// is equal to the neighbor_disance, then the cube will consist of only the cell.
+		// So we don't need to check the cell interior edge conditions, those are covered already.
+
+		// Issue: What about the edge of the grid? We need to wrap around.
+		// We can handle this during the search. If any cell value exceeds the resolution of the grid,
+		// then we loop around.
+		glm::vec3 min_cell_search = glm::clamp(min_neighbor_cell, glm::vec3(0), glm::vec3(gridResolution));
+		glm::vec3 max_cell_search = glm::clamp(max_neighbor_cell, glm::vec3(0), glm::vec3(gridResolution));
+
+		// After all that work, we now start applying rules! Instead of searching over N boids, we will search over the boids
+		// in each cell between min_cell_search and max_cell_search only.
+		// I can already see how making the boids cohenernt in memory helps simplify this, but that's for later.
+		glm::vec3 velocity_change(0.0f);
+		glm::vec3 selfPos = pos[index];
+		glm::vec3 selfVelocity = vel1[index];
+		glm::vec3 alignment_perceived_velocity(0.0f);
+		glm::vec3 cohesion_perceived_center(0.0f);
+		glm::vec3 seperation(0.0f);
+		int alignment_neighbors = 0.0;
+		int cohesion_neighbors = 0.0;
+		int seperation_neighbors = 0.0;
+
+		for (float z = min_cell_search.z; z <= max_cell_search.z; z++) {
+			for (float y = min_cell_search.y; y <= max_cell_search.y; y++) {
+				for (float x = min_cell_search.x; x <= max_cell_search.x; x++) {
+					// Yikes! Triple for loop??? Not that bad, min and max differ by at most one.
+					// Calculate grid index of the grid under inspection
+					int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution);
+
+					// Get the start and end indices for the boids
+					int b_start = gridCellStartIndices[gridIdx];
+					int b_end = gridCellEndIndices[gridIdx];
+
+					// Check if any values are -1, meaning an empty cell.
+					// Also check if less than N, don't want to cause a segfault.
+					if (b_start < 0 || b_start > N || b_end < 0 || b_end > N) {
+						continue;
+					}
+
+					// We now have the boids we need. Run each rule over the range of boids.
+					// WOW: Unrolling these calls gave a 2.27X performance boost!!!! Cache benefits?
+					//velocity_change += boidAlignmentRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+					//velocity_change += boidCohesionRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+					//velocity_change += boidSeperationRuleGrid(N, index, particleArrayIndices, b_start, b_end, pos, vel1);
+
+					for (int i = b_start; i <= b_end; i++) {
+						if (index == i) { // Dip out early.
+							continue;
+						}
+
+						// Get relevant data
+						glm::vec3 boid_pos = pos[i];
+						float distance = glm::distance(selfPos, boid_pos);
+
+						// Cohesion
+						if (distance < rule1Distance) {
+							cohesion_perceived_center += boid_pos;
+							cohesion_neighbors++;
+						}
+						// Seperation
+						if (distance < rule2Distance) {
+							seperation -= boid_pos - selfPos;
+							seperation_neighbors++;
+						}
+						// Alignment
+						if (distance < rule3Distance) {
+							alignment_perceived_velocity += vel1[i];
+							alignment_neighbors++;
+						}
+					}
+				}
+			}
+		}
+
+		// Finalize Cohesion values
+		if (cohesion_neighbors) {
+			cohesion_perceived_center /= cohesion_neighbors;
+			velocity_change += (cohesion_perceived_center - selfPos) * rule1Scale;
+		}
+		// Finalize Seperation Values
+		if (seperation_neighbors) {
+			velocity_change += seperation * rule2Scale;
+		}
+		// Finalize Alignment Values
+		if (alignment_neighbors) {
+			alignment_perceived_velocity /= alignment_neighbors;
+			velocity_change += alignment_perceived_velocity * rule3Scale;
+		}
+
+		// Calculated total velocity change! Now apply, clamp, and store.
+		glm::vec3 newV = selfVelocity + velocity_change;
+		if (glm::length(newV) > maxSpeed) {
+			newV = glm::normalize(newV) * maxSpeed;
+		}
+		vel2[index] = newV;
+	}
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	// TODO-1.2 ping-pong the velocity buffers
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2); // Use new velocity!
+
+	// Swap buffers, ping pong!
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +895,68 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// Run the index labeling kernel
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize >>>(
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		dev_pos,
+		dev_particleArrayIndices,
+		dev_particleGridIndices
+	);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+	// Use Thrust API to sort the indicies...
+	thrust::sort_by_key(
+		dev_thrust_particleGridIndices,
+		dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices
+	);
+
+	// Locate start and stop indicies
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize >>>(gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer1 failed!");
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize >>>(gridCellCount, dev_gridCellEndIndices,   -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer2 failed!");
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize >>> (
+		numObjects,
+		dev_particleGridIndices,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices
+	);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+	// The fun part! Calculate velocity changes
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize >>>(
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		gridCellWidth,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices,
+		dev_particleArrayIndices,
+		dev_pos,
+		dev_vel1,
+		dev_vel2
+	);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+	// Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize >>>(
+		numObjects,
+		dt,
+		dev_pos,
+		dev_vel2
+	);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+	// Swap buffers and you're done!
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +975,84 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// Run the index labeling kernel
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		dev_pos,
+		dev_particleArrayIndices,
+		dev_particleGridIndices
+		);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+	// Use Thrust API to sort the indicies...
+	thrust::sort_by_key(
+		dev_thrust_particleGridIndices,
+		dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices
+	);
+
+	// Locate start and stop indicies
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer1 failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer2 failed!");
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		dev_particleGridIndices,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices
+		);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+	// Difference is here. We want the values in gridCellStart and gridCellEnd to be tghe exact indicies of each 
+	// boid in the pos/vel arrays. This means we must rearrange the pos/vel arrays approprietly.
+	// We can do this in a kernel! Create an additional posSorted and vel1Sorted array. We take the old pos
+	// and vel1 arrays and copy them over based on the values in dev_particleGridIndices.
+	kernRearrangeCoherentData << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		dev_particleArrayIndices,
+		dev_pos,
+		dev_coherent_pos,
+		dev_vel1,
+		dev_coherent_vel1
+	);
+
+	// Data has been sorted coherrently, copy it back to original buffers.
+	cudaMemcpy(dev_pos,  dev_coherent_pos,  numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	cudaMemcpy(dev_vel1, dev_coherent_vel1, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+	// The fun part! Calculate velocity changes
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		gridCellWidth,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices,
+		dev_pos,
+		dev_vel1,
+		dev_vel2
+		);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+	// Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		dt,
+		dev_pos,
+		dev_vel2
+		);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+	// Swap buffers and you're done!
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +1061,10 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..803c9c6 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,9 +13,10 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+#define GET_METRICS 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -219,8 +220,21 @@ void initShaders(GLuint * program) {
 
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
-
-    while (!glfwWindowShouldClose(window)) {
+	
+	// For recording purposes, sometimes helpful to wait until user input to start
+	//getchar();
+	
+	// Begin Performance metrics.
+	int frames = 0;
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	cudaEventRecord(start);
+    while (!glfwWindowShouldClose(window) && frames < 10000 ) {
+#ifdef GET_METRICS
+	  frames++;
+#endif
       glfwPollEvents();
 
       frame++;
@@ -256,6 +270,17 @@ void initShaders(GLuint * program) {
       glfwSwapBuffers(window);
       #endif
     }
+	// Metrics
+	cudaEventRecord(stop);
+	cudaEventSynchronize(stop);
+	float milliseconds = 0;
+	cudaEventElapsedTime(&milliseconds, start, stop);
+
+	printf("Boids: %d\n", N_FOR_VIS);
+	printf("Uniform Grid? %d\nCoherent Grid? %d\n", UNIFORM_GRID, COHERENT_GRID);
+	printf("Average milliseconds/Frame: %f\n", milliseconds / frames);
+	printf("Average FPS: %f\n", frames / (milliseconds / 1000));
+
     glfwDestroyWindow(window);
     glfwTerminate();
   }