diff --git a/README.md b/README.md
index d63a6a1..d6beffb 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,131 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+Boids Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Grace Gilbert
+  * gracelgilbert.com
+* Tested on: Windows 10, i9-9900K @ 3.60GHz 64GB, GeForce RTX 2080 40860MB
 
-### (TODO: Your README)
+![](/images/CoverGif.gif)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Overview
+This project implements a flocking simulation using the Reynolds Boids algorithm.  A flocking simulation is an artificial life model, where particles (boids) behave and interact with each other in a natural way.  The boids travel around the simulation space according to three rules:
+### Rule 1 - Cohesion 
+This rule causes boids to move towards other boids within their nearby neighborhood.  Boids move towards the center of mass of their neighboring boids within a certain radius.
+
+Pseudocode for cohesion:
+```
+function rule1(Boid boid)
+
+    Vector perceived_center
+
+    foreach Boid b:
+        if b != boid and distance(b, boid) < rule1Distance then
+            perceived_center += b.position
+        endif
+    end
+
+    perceived_center /= number_of_neighbors
+
+    return (perceived_center - boid.position) * rule1Scale
+end
+```
+
+Below is the simulation with only cohesion activated.  Notice how the boids clump together in small neighborhoods.
+![](/images/cohesionOnly.gif)
+
+### Rule 2 - Separation
+The separation rule allows boids to stay some distance away from other objects, in this case other boids.  For neighboring objects within a certain radius, a boid will adjust its velocity to move away from that object and avoid it.  This rule prevents the boids from continuously colliding into each other.
+
+Pseudocode for separation:
+```
+function rule2(Boid boid)
+
+    Vector c = 0
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule2Distance then
+            c -= (b.position - boid.position)
+        endif
+    end
+
+    return c * rule2Scale
+end
+```
+### Rule 3 - Alignment
+Boids try to match the direction and speed of their neighbors. Boids look at their neighbors within a certain radius and take on an average of their velocities, enabling the boids to move together with similar velocities.  
+Pseudocode for alignment:
+```
+function rule3(Boid boid)
+
+    Vector perceived_velocity
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule3Distance then
+            perceived_velocity += b.velocity
+        endif
+    end
+
+    perceived_velocity /= number_of_neighbors
+
+    return perceived_velocity * rule3Scale
+end
+```
+### Rule integration
+Each of the three rules produce a single change in velocity value for each boid.  To obtain the boid's final change in velocity value, we scale each of the three velocities and sum them together, adding them to the boid's previous frame velocity.  The scale values for each rule, as well as the neighborhood radii within each rule, alter the way the boids move with each other.
+
+## Naive Approach
+The first implementation uses a naive, brute force approach.  This is the least efficient of the three implementations.  For each boid, we iterate over all other boids in the simulation and check if they are within the radius of each rule and finally apply the rules. This is very inefficient, as many of the boids will not be near the neighborhood of the current boid, and we are still iterating through all of them.  A more efficient approach would be to avoid having to iterate over boids that are absolutely not going to be close enough to affect the current boid. 
+
+## Uniform Grid
+In this approach, we divide the simulation space into a uniform grid.  Each boid's position can now be found in a single cell in the grid.  The ID of the grid cell is determined by transforming the position to a space where each grid is dimension 1 oriented at the origin, and then flooring this position value to get an integral 3 dimensional index of the grid cell.  We can then use this grid to help narrow down the neighborhood search space.
+### Scattered
+The first uniform grid implementation is with a scattered grid. In this implementation, when we are determining the change in velocity of a boid, we limit the search area to a box of grid cells surrounding the current boid's cells.  The bounds of this box depends on the radii for each of the rules. To ensure that we are covering all potential cells with neighbors, we first find the maximum rule radius, so the maximum of the cohesion radius, separation radius, and alignment radius.  For the lower bound of the search box, we subtract this max radius from each dimension of the boid's position, convert it to grid cell space, and floor the value to include the lower bound grid cells. For the upper bound of the search box, we add the max radius, converto cell space, and then take the ceiling of it to ensure that we are getting the upper bound and are not rounding down and missing potential neighbors.  
+
+This system ensures that we are getting the best size search box.  If instead we just had a hard coded search box size, we may be overestimating and searching through more grid cells than necessary, or vice versa nad end up missing grid cells.  
+
+To achieve this method, we start by computing the grid index for each boid, creating a map from boid index to grid cell index.  We then sort this map according to grid cell index, enabling us to iterate through the grid cell indices and determine which boids belong in each cell.  When we first hit a new grid cell index, we know we have entered a new cell, and all corresponding boid indices belong to that grid cell, until a new cell index is found.  Therefore, once we determine which grid cells are in our search block, we can find the boid indices to iterate over.  We then have a separate map of boid indices to their positions and velocities, which are needed to apply the behavioral rules.
+
+The diagrams below illustrate this flow of data:
+
+![buffers for generating a uniform grid using index sort](images/Boids%20Ugrids%20buffers%20naive.png)
+
+### Coherent
+The second uniform grid implementation shares the same search algorithm as the scattered grid, but the organization of the index data is altered to be more efficient.  In the previous approach, once we figured out which grid cells to search through, we found the boids in those cells, but then had to access another buffer, particlArrayIndices, to determine the index to use to access the position and velocity data.  This approach cuts out this middle step of accessing particleArrayIndices.  To do this, we reshuffle the position and velocity data to match the way the boid indices were reshuffled when we sorted by grid cell index.  This way, we now have a direct mapping from grid cell start and end values to positions and velocities.  
+The diagrams below illustrate this flow of data for the coherent method:
+
+![buffers for generating a uniform grid using index sort, then making the boid data coherent](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+## Performance Analysis
+The following are plots of the frames per second in relation to both the number of boids in the simulation and the block size.
+
+![](/images/FPStoNumBoids.PNG)
+
+![](/images/FPStoBlockSize.PNG)
+
+## Bugs and Errors
+With some numbers of boids, the simulation behaves incorrectly.  The following is the simulation run with 160000 boids:
+
+![](/images/bugUpDown.gif)
+
+I am unsure exactly what is happening here, but it seems like at some values, my velocity calculation may break.  At other values, many of the boids disappear, which seems to indicate NaN or really high values for the velocities, but I was unable to determine where these bugs were coming from and how to fix them.
+
+## Questions
+### For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+
+For the naive implementation, increasing the number of boids exponentially slows down the framerate.  This is because every boid must iterate through every other boid, so the number of comparisons is O(n * n).  If we double n, we quadruple the computation.
+
+For the uniform grid implementation, the number of boids affects performance more linearly and more sporadically.  The runtime of this method is dependent on how many neighbors each boid has, so adding more boids may not drastically change how many neighbors to iterate through.  But at some point, the number of boids might generate more dense flocking, causing more nearby boids to search through.  Additionally, the framerate slows as the boids flock together, as they are closer together, causing each boid to have more neighbors.  For the coherent version, the framerate does not start at its peak, and instead increases at first with more boids.  This may be because the overhead of the reshuffling does not pay off until there are more boids.  
+
+### For each implementation, how does changing the block count and block size affect performance?  Why do you think this is?
+
+Changing the block count did not dramatically affect the performance.  For the naive implementation, the framerate slowed at a blocksize of 512, maybe because the blocks were larger than necessary so threads were not being used.  I noticed a similar pattern in the scattered grid implementation.  For the coherent grid, the behavior was more sporadic, possibly because the framerate changed and depended so much on the behavior of the boids.  
+
+If I were to redo the performance analysis, I would find a more consistent way to measure the performance/framerate that doesn't depend as much on the specific simulation and averages the framerate over the course of the simulation.  
+
+### For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid?  Was this the outcome you expected? Why or why not?
+
+I did experience performance improvements with the more coherent uniform grid. It was not significantly faster than the scattered grid, but it was consistently faster for every test.  This is what I expected overall, as the optimization cuts out the middle indexing step.  However, I did expect there to be some cases, maybe for the smaller number of boids, where the overhead of the shuffling was more costly than the middle indexing step, causing the scattered to be faster for some tests.  This may be something on my setup that allows that overhead to not add too much cost, or I may not have gone to low enough numbers of boids to see this effect. 
+
+### Did changing cell width and checking 27 vs 8 neighboring cells affect performance?  Why or why not?
+
+Changing the cell width does affect the performance.  Increasing the size of the cells means that fewer number of cells must be checked, as they encompass more volume.  However, it also means that the maximum radius might just touch one of the cells, and then the boid will have to search through all of the boids within that larger cell, potentially adding many more iterations.  There must therefore be a balance between the number of cells searched and the size of the searched cells.  The increased number of iterations that come from larger cells can build up faster than the increased number of cells to check, so the optimal cell width will be smaller rather than larger.  In terms of checking 27 vs 8 neighboring cells, my implementation created a boudning box around the boid that was dependent on its maximum rule radius.  This ensures that we are not searching within too wide of a box, but are also searching enough surrounding cells.  
diff --git a/images/CoverGif.gif b/images/CoverGif.gif
new file mode 100644
index 0000000..7619c26
Binary files /dev/null and b/images/CoverGif.gif differ
diff --git a/images/FPMtoBlockSize.PNG b/images/FPMtoBlockSize.PNG
new file mode 100644
index 0000000..e7791ae
Binary files /dev/null and b/images/FPMtoBlockSize.PNG differ
diff --git a/images/FPMtoNumBoids.PNG b/images/FPMtoNumBoids.PNG
new file mode 100644
index 0000000..3d482a8
Binary files /dev/null and b/images/FPMtoNumBoids.PNG differ
diff --git a/images/FPStoBlockSize.PNG b/images/FPStoBlockSize.PNG
new file mode 100644
index 0000000..50a6a06
Binary files /dev/null and b/images/FPStoBlockSize.PNG differ
diff --git a/images/FPStoNumBoids.PNG b/images/FPStoNumBoids.PNG
new file mode 100644
index 0000000..2295200
Binary files /dev/null and b/images/FPStoNumBoids.PNG differ
diff --git a/images/TestRecording.gif b/images/TestRecording.gif
new file mode 100644
index 0000000..0478df9
Binary files /dev/null and b/images/TestRecording.gif differ
diff --git a/images/boids30000Coherent.PNG b/images/boids30000Coherent.PNG
new file mode 100644
index 0000000..37a3256
Binary files /dev/null and b/images/boids30000Coherent.PNG differ
diff --git a/images/bugUpDown.gif b/images/bugUpDown.gif
new file mode 100644
index 0000000..a243615
Binary files /dev/null and b/images/bugUpDown.gif differ
diff --git a/images/cohesionOnly.gif b/images/cohesionOnly.gif
new file mode 100644
index 0000000..6ac26a2
Binary files /dev/null and b/images/cohesionOnly.gif differ
diff --git a/images/whiteScreenBug.PNG b/images/whiteScreenBug.PNG
new file mode 100644
index 0000000..8c03fec
Binary files /dev/null and b/images/whiteScreenBug.PNG differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..e0a9a45 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -21,14 +21,14 @@
 * Check for CUDA errors; print and exit if there was a problem.
 */
 void checkCUDAError(const char *msg, int line = -1) {
-  cudaError_t err = cudaGetLastError();
-  if (cudaSuccess != err) {
-    if (line >= 0) {
-      fprintf(stderr, "Line %d: ", line);
-    }
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
-  }
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		if (line >= 0) {
+			fprintf(stderr, "Line %d: ", line);
+		}
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
 }
 
 
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_reshuffledPos;
+glm::vec3 *dev_reshuffledVel1;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -99,13 +101,13 @@ glm::vec3 gridMinimum;
 ******************/
 
 __host__ __device__ unsigned int hash(unsigned int a) {
-  a = (a + 0x7ed55d16) + (a << 12);
-  a = (a ^ 0xc761c23c) ^ (a >> 19);
-  a = (a + 0x165667b1) + (a << 5);
-  a = (a + 0xd3a2646c) ^ (a << 9);
-  a = (a + 0xfd7046c5) + (a << 3);
-  a = (a ^ 0xb55a4f09) ^ (a >> 16);
-  return a;
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
 }
 
 /**
@@ -113,10 +115,10 @@ __host__ __device__ unsigned int hash(unsigned int a) {
 * Function for generating a random vec3.
 */
 __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
-  thrust::default_random_engine rng(hash((int)(index * time)));
-  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
+	thrust::default_random_engine rng(hash((int)(index * time)));
+	thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
 
-  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+	return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
 }
 
 /**
@@ -124,52 +126,77 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		glm::vec3 rand = generateRandomVec3(time, index);
+		arr[index].x = scale * rand.x;
+		arr[index].y = scale * rand.y;
+		arr[index].z = scale * rand.z;
+	}
 }
 
 /**
 * Initialize memory, update some globals
 */
 void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
-
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
-
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
-
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
+	numObjects = N;
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	// LOOK-1.2 - This is basic CUDA memory management and error checking.
+	// Don't forget to cudaFree in  Boids::endSimulation.
+	cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+	cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+
+	// LOOK-1.2 - This is a typical CUDA kernel invocation.
+	kernGenerateRandomPosArray <<<fullBlocksPerGrid, blockSize >>> (1, numObjects,
+		dev_pos, scene_scale);
+	checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+
+	// LOOK-2.1 computing grid params
+	gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+	int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
+	gridSideCount = 2 * halfSideCount;
+
+	gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+	gridInverseCellWidth = 1.0f / gridCellWidth;
+	float halfGridWidth = gridCellWidth * halfSideCount;
+	gridMinimum.x -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
+	gridMinimum.z -= halfGridWidth;
+
+	// TODO-2.1 - Allocate additional buffers here.
+	cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	dev_thrust_particleArrayIndices = thrust::device_pointer_cast<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_pointer_cast<int>(dev_particleGridIndices);
+
+	
+	// TODO-2.3 - Allocate additional buffers here.
+	cudaMalloc((void**)&dev_reshuffledPos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_reshuffledPos failed!");
+
+	cudaMalloc((void**)&dev_reshuffledVel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_reshuffledVel1 failed!");
+
+
+	cudaDeviceSynchronize();
 }
 
 
@@ -181,41 +208,41 @@ void Boids::initSimulation(int N) {
 * Copy the boid positions into the VBO so that they can be drawn by OpenGL.
 */
 __global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
-  float c_scale = -1.0f / s_scale;
+	float c_scale = -1.0f / s_scale;
 
-  if (index < N) {
-    vbo[4 * index + 0] = pos[index].x * c_scale;
-    vbo[4 * index + 1] = pos[index].y * c_scale;
-    vbo[4 * index + 2] = pos[index].z * c_scale;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	if (index < N) {
+		vbo[4 * index + 0] = pos[index].x * c_scale;
+		vbo[4 * index + 1] = pos[index].y * c_scale;
+		vbo[4 * index + 2] = pos[index].z * c_scale;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  if (index < N) {
-    vbo[4 * index + 0] = vel[index].x + 0.3f;
-    vbo[4 * index + 1] = vel[index].y + 0.3f;
-    vbo[4 * index + 2] = vel[index].z + 0.3f;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (index < N) {
+		vbo[4 * index + 0] = vel[index].x + 0.3f;
+		vbo[4 * index + 1] = vel[index].y + 0.3f;
+		vbo[4 * index + 2] = vel[index].z + 0.3f;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 /**
 * Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
 */
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
-  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+	kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, vbodptr_positions, scene_scale);
+	kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
-  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+	checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
-  cudaDeviceSynchronize();
+	cudaDeviceSynchronize();
 }
 
 
@@ -223,6 +250,10 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
+__device__ float distanceSq(const glm::vec3 p1, const glm::vec3 p2) {
+	return pow((p1.x - p2.x), 2) + pow((p1.y - p2.y), 2) + pow((p1.z - p2.z), 2);
+}
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -230,10 +261,67 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	
+	const glm::vec3 thisPos = pos[iSelf];
+	const glm::vec3 thisVel = vel[iSelf];
+
+	glm::vec3 cohVel = glm::vec3(0.f);
+	glm::vec3 sepVel = glm::vec3(0.f);
+	glm::vec3 alignVel = glm::vec3(0.f);
+
+	glm::vec3 perceived_center = glm::vec3(0.f);
+	glm::vec3 perceived_velocity = glm::vec3(0.f);
+
+	float rule1Neighbors = 0.f;
+	float rule3Neighbors = 0.f;
+
+	for (int i = 0; i < N; ++i) {
+		if (i == iSelf) {
+			continue;
+		}
+		glm::vec3 bPos = pos[i];
+		float dSq = distanceSq(bPos, thisPos);
+
+		// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+		if (dSq < rule1Distance * rule1Distance) {
+			perceived_center += bPos;
+			rule1Neighbors++;
+		}
+
+		// Rule 2: boids try to stay a distance d away from each other
+		if (dSq < rule2Distance * rule2Distance) {
+			sepVel -= (bPos - thisPos);
+		}
+
+		// Rule 3: boids try to match the speed of surrounding boids
+		if (dSq < rule3Distance * rule3Distance) {
+			perceived_velocity += vel[i];
+			rule3Neighbors++;
+		}
+	}
+
+	if (rule1Neighbors > 0) {
+		perceived_center /= rule1Neighbors;
+		cohVel = (perceived_center - thisPos) * rule1Scale;
+	}
+	else {
+		cohVel = glm::vec3(0.f);
+	}
+
+
+
+	sepVel *= rule2Scale;
+
+	if (rule3Neighbors > 0) {
+		perceived_velocity /= rule3Neighbors;
+		alignVel = perceived_velocity * rule2Scale;
+	}
+	else {
+		alignVel = glm::vec3(0.f);
+	}
+	
+	return cohVel + sepVel + alignVel;
 }
 
 /**
@@ -241,10 +329,32 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 *vel1, glm::vec3 *vel2) {
+
+	// Compute a new velocity based on pos and vel1
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+
+	glm::vec3 outVel = computeVelocityChange(N, index, pos, vel1);
+	outVel += vel1[index];
+
+
+	// Clamp the speed
+
+	float speedSq = distanceSq(outVel, glm::vec3(0.0));
+
+	if (speedSq > pow(maxSpeed, 2)) {
+		outVel /= sqrt(speedSq);
+		outVel *= maxSpeed;
+	}
+
+
+	// Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = outVel;
+
 }
 
 /**
@@ -252,24 +362,24 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
-
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
-
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
-
-  pos[index] = thisPos;
+	// Update position by velocity
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	thisPos += vel[index] * dt;
+
+	// Wrap the boids around so we don't lose them
+	thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+	thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+	thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+	thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+	thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+	thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+	pos[index] = thisPos;
 }
 
 // LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
@@ -279,179 +389,509 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //            for(y)
 //             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+	return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3 *pos, int *indices, int *gridIndices) {
+	// TODO-2.1
+	
+	// Get index of current thread
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// - Label each boid with the index of its grid cell.
+	glm::vec3 thisPos = pos[index];
+	glm::ivec3 cellValue = glm::floor((thisPos - gridMin) * inverseCellWidth);
+	gridIndices[index] = gridIndex3Dto1D(cellValue.x, cellValue.y, cellValue.z, gridResolution);
+
+	// - Set up a parallel array of integer indices as pointers to the actual
+	//   boid data in pos and vel1/vel2
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+	
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	int currGridIndex = particleGridIndices[index];
+
+	// If first index, set start index of currGridIndex to 0
+	if (index == 0) {
+		gridCellStartIndices[currGridIndex] = index; // index = 0
+		return;
+	}
+
+	int prevGridIndex = particleGridIndices[index - 1];
+	if (prevGridIndex != currGridIndex) {
+		gridCellStartIndices[currGridIndex] = index;
+		gridCellEndIndices[prevGridIndex] = index; 
+		// Note: the cell does not include end index value, only includes end index - 1
+	}
+
+	// If last index, set the end index of the currGridIndex to N
+	if (index == N - 1) {
+		gridCellEndIndices[currGridIndex] = index + 1; // index + 1 = N
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	
+	// - Identify the grid cell that this particle is in
+	glm::vec3 thisPos = pos[index];
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	float maxRuleDist = glm::max(rule1Distance, glm::max(rule2Distance, rule3Distance));
+	glm::vec3 maxRuleDistVec(maxRuleDist);
+	// Bounding indices:
+	glm::ivec3 minPosCellIndex3D = glm::clamp(glm::floor((thisPos - maxRuleDistVec - gridMin) * inverseCellWidth), 
+											  glm::vec3(0.0), glm::vec3(gridResolution));
+	glm::ivec3 maxPosCellIndex3D = glm::clamp(glm::ceil((thisPos + maxRuleDistVec - gridMin) * inverseCellWidth),
+											  glm::vec3(0.0), glm::vec3(gridResolution));
+
+	// - For each cell, read the start/end indices in the boid pointer array.
+
+	glm::vec3 cohVel = glm::vec3(0.f);
+	glm::vec3 sepVel = glm::vec3(0.f);
+	glm::vec3 alignVel = glm::vec3(0.f);
+
+	glm::vec3 perceived_center = glm::vec3(0.f);
+	glm::vec3 perceived_velocity = glm::vec3(0.f);
+
+	float rule1Neighbors = 0.f;
+	float rule3Neighbors = 0.f;
+
+
+	for (int z = minPosCellIndex3D.z; z <= maxPosCellIndex3D.z; ++z) {
+		for (int y = minPosCellIndex3D.y; y <= maxPosCellIndex3D.y; ++y) {
+			for (int x = minPosCellIndex3D.x; x <= maxPosCellIndex3D.x; ++x) {
+				// Get boides in curr cell:
+				int currCell = gridIndex3Dto1D(x, y, z, gridResolution);
+				int currCellStart = gridCellStartIndices[currCell];
+				int currCellEnd = gridCellEndIndices[currCell];
+
+				// check if out of bounds
+				if (currCellStart < 0 || currCellStart >= N || currCellEnd < 0 || currCellEnd >= N) {
+					continue;
+				}
+
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				for (int b = currCellStart; b < currCellEnd; ++b) {
+					// This loop gets all the boid indices in the current cell
+					int currBoid = particleArrayIndices[b];
+					if (currBoid == index) {
+						continue;
+					}
+
+					glm::vec3 bPos = pos[currBoid];
+					float distSq = distanceSq(bPos, thisPos);
+					// Rule 1:
+					if (distSq < pow(rule1Distance, 2)) {
+						perceived_center += bPos;
+						rule1Neighbors++;
+					}
+					// Rule 2:
+					if (distSq < pow(rule2Distance, 2)) {
+						sepVel -= (bPos - thisPos);
+					}
+					// Rule 3:
+					if (distSq < pow(rule3Distance, 2)) {
+						perceived_velocity += vel1[currBoid];
+						rule3Neighbors++;
+					}
+				}
+			}
+		}
+	}
+	if (rule1Neighbors > 0) {
+		perceived_center /= rule1Neighbors;
+		cohVel = (perceived_center - thisPos) * rule1Scale;
+	}
+	else {
+		cohVel = glm::vec3(0.f);
+	}
+
+	sepVel *= rule2Scale;
+	if (rule3Neighbors > 0) {
+		perceived_velocity /= rule3Neighbors;
+		alignVel = perceived_velocity * rule3Scale;
+	}
+	else {
+		alignVel = glm::vec3(0.f);
+	}
+	
+	glm::vec3 outVel = vel1[index] + cohVel + sepVel + alignVel;
+
+	// - Clamp the speed change before putting the new speed in vel2
+
+	float speedSq = distanceSq(outVel, glm::vec3(0.f));
+
+	if (speedSq > maxSpeed * maxSpeed) {
+		outVel = glm::normalize(outVel) * maxSpeed;
+	}
+
+	// Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = outVel;
+		
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+		int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+		if (index >= N) {
+			return;
+		}
+
+	// - Identify the grid cell that this particle is in
+		glm::vec3 thisPos = pos[index];
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+		float maxRuleDist = glm::max(rule1Distance, glm::max(rule2Distance, rule3Distance));
+		glm::vec3 maxRuleDistVec(maxRuleDist);
+		// Bounding indices:
+		glm::ivec3 minPosCellIndex3D = glm::floor((thisPos - maxRuleDistVec - gridMin) * inverseCellWidth);
+		glm::ivec3 maxPosCellIndex3D = glm::ceil((thisPos + maxRuleDistVec - gridMin) * inverseCellWidth);
+
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+		glm::vec3 cohVel = glm::vec3(0.f);
+		glm::vec3 sepVel = glm::vec3(0.f);
+		glm::vec3 alignVel = glm::vec3(0.f);
+
+		glm::vec3 perceived_center = glm::vec3(0.f);
+		glm::vec3 perceived_velocity = glm::vec3(0.f);
+
+		float rule1Neighbors = 0.f;
+		float rule3Neighbors = 0.f;
+
+		for (int z = glm::max(minPosCellIndex3D.z, 0); z <= glm::min(maxPosCellIndex3D.z, gridResolution); ++z) {
+			for (int y = glm::max(minPosCellIndex3D.y, 0); y <= glm::min(maxPosCellIndex3D.y, gridResolution); ++y) {
+				for (int x = glm::max(minPosCellIndex3D.x, 0); x <= glm::min(maxPosCellIndex3D.x, gridResolution); ++x) {
+					// Get boides in curr cell:
+					int currCell = gridIndex3Dto1D(x, y, z, gridResolution);
+					int currCellStart = gridCellStartIndices[currCell];
+					int currCellEnd = gridCellEndIndices[currCell];
+
+					// check if out of bounds
+					if (currCellStart < 0 || currCellStart >= N || currCellEnd < 0 || currCellEnd >= N) {
+						continue;
+					}
+
+					// - Access each boid in the cell and compute velocity change from
+					//   the boids rules, if this boid is within the neighborhood distance.
+					for (int b = currCellStart; b < currCellEnd; ++b) {
+						// This loop gets all the boid indices in the current cell
+						if (b == index) {
+							continue;
+						}
+
+						glm::vec3 bPos = pos[b];
+						float distSq = distanceSq(bPos, thisPos);
+						// Rule 1:
+						if (distSq < pow(rule1Distance, 2)) {
+							perceived_center += bPos;
+							rule1Neighbors++;
+						}
+						// Rule 2:
+						if (distSq < pow(rule2Distance, 2)) {
+							sepVel -= (bPos - thisPos);
+						}
+						// Rule 3:
+						if (distSq < pow(rule3Distance, 2)) {
+							perceived_velocity += vel1[b];
+							rule3Neighbors++;
+						}
+					}
+				}
+			}
+		}
+		if (rule1Neighbors > 0) {
+			perceived_center /= rule1Neighbors;
+			cohVel = (perceived_center - thisPos) * rule1Scale;
+		}
+		else {
+			cohVel = glm::vec3(0.f);
+		}
+
+		sepVel *= rule2Scale;
+		if (rule3Neighbors > 0) {
+			perceived_velocity /= rule3Neighbors;
+			alignVel = perceived_velocity * rule3Scale;
+		}
+		else {
+			alignVel = glm::vec3(0.f);
+		}
+
+		glm::vec3 outVel = vel1[index] + cohVel + sepVel + alignVel;
+
+		// - Clamp the speed change before putting the new speed in vel2
+
+		float speedSq = distanceSq(outVel, glm::vec3(0.f));
+
+		if (speedSq > maxSpeed * maxSpeed) {
+			outVel = (outVel / sqrt(speedSq)) * maxSpeed;
+		}
+
+		vel2[index] = outVel;
+		// vel2[index] = glm::vec3(1.0);
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+
+	dim3 blocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce << <blocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("Brute force velocity errored!");
+	kernUpdatePos << <blocksPerGrid, threadsPerBlock >> > (numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("update pos errored!");
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	// TODO-1.2 ping-pong the velocity buffers
+
+
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	dim3 blocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices<<<blocksPerGrid, threadsPerBlock>>>(
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		dev_pos,
+		dev_particleArrayIndices,
+		dev_particleGridIndices);
+	checkCUDAErrorWithLine("failed at kernComputeIndices!");
+	
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, 
+		dev_thrust_particleGridIndices + numObjects, 
+		dev_thrust_particleArrayIndices);
+	
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// Default all to -1 so if empty, will remain -1 to indicate empty
+	kernResetIntBuffer<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("failed at kernResetIntBuffer");
+	kernResetIntBuffer<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("failed at kernResetIntBuffer");
+	kernIdentifyCellStartEnd<<<blocksPerGrid, threadsPerBlock >>>(
+		numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("failed at kernIdentifyCellStartEnd!");
+
+
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered<<<blocksPerGrid, threadsPerBlock>>>(
+		numObjects, gridSideCount, gridMinimum, 
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, 
+		dev_gridCellEndIndices, dev_particleArrayIndices, 
+		dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("failed at kernUpdateVelNeighborSearchScattered");
+
+	// - Update positions
+	kernUpdatePos<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("failed at kernUpdatePos");
+	// - Ping-pong buffers as needed
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	
+}
+
+__global__ void kernShuffle(int N, glm::vec3 *inputOrder, glm::vec3 *shuffledOrder, int *particleArrayIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	// Note: In updateVelScattered, we found the pos and vel at particleArrayIndices[index].
+	// This takes that value and puts in in the same index position in the shuffled order
+	shuffledOrder[index] = inputOrder[particleArrayIndices[index]];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// In Parallel:
+	// - Label each particle with its array index as well as its grid index.
+	//   Use 2x width grids
+	dim3 blocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <blocksPerGrid, threadsPerBlock >> > (
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridInverseCellWidth,
+		dev_pos,
+		dev_particleArrayIndices,
+		dev_particleGridIndices);
+	checkCUDAErrorWithLine("failed at kernComputeIndices!");
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	dev_thrust_particleArrayIndices = thrust::device_pointer_cast<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_pointer_cast<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices,
+		dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// Default all to -1 so if empty, will remain -1 to indicate empty
+	kernResetIntBuffer << <blocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <blocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd << <blocksPerGrid, threadsPerBlock >> > (
+		numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("failed at kernIdentifyCellStartEnd!");
+
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+
+	kernShuffle << <blocksPerGrid, threadsPerBlock >> > (numObjects, dev_pos, dev_reshuffledPos, dev_particleArrayIndices);
+	kernShuffle << <blocksPerGrid, threadsPerBlock >> > (numObjects, dev_vel1, dev_reshuffledVel1, dev_particleArrayIndices);
+
+	// - Perform velocity updates using neighbor search
+	// Need to get the shuffled data into the original buffers:
+	cudaDeviceSynchronize(); // sync before to make sure everything is done shuffling
+	cudaMemcpy(dev_pos, dev_reshuffledPos, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	cudaMemcpy(dev_vel1, dev_reshuffledVel1, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+
+	kernUpdateVelNeighborSearchCoherent << <blocksPerGrid, threadsPerBlock >> > (
+		numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices,
+		dev_pos, dev_vel1, dev_vel2); // The reshuffled data should be in dev_pos and dev_vel1 now
+
+	// - Update positions
+	kernUpdatePos << <blocksPerGrid, threadsPerBlock >> > (numObjects, dt, dev_pos, dev_vel1);
+
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE. WHy different?
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
 }
 
 void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_vel1);
+	cudaFree(dev_vel2);
+	cudaFree(dev_pos);
+
+	// TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+
+	cudaFree(dev_reshuffledPos);
+	cudaFree(dev_reshuffledVel1);
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
-  // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
-  int N = 10;
-
-  std::unique_ptr<int[]>intKeys{ new int[N] };
-  std::unique_ptr<int[]>intValues{ new int[N] };
-
-  intKeys[0] = 0; intValues[0] = 0;
-  intKeys[1] = 1; intValues[1] = 1;
-  intKeys[2] = 0; intValues[2] = 2;
-  intKeys[3] = 3; intValues[3] = 3;
-  intKeys[4] = 0; intValues[4] = 4;
-  intKeys[5] = 2; intValues[5] = 5;
-  intKeys[6] = 2; intValues[6] = 6;
-  intKeys[7] = 0; intValues[7] = 7;
-  intKeys[8] = 5; intValues[8] = 8;
-  intKeys[9] = 6; intValues[9] = 9;
-
-  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
-
-  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
-
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  std::cout << "before unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // How to copy data to the GPU
-  cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-  cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-
-  // Wrap device vectors in thrust iterators for use with thrust.
-  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
-  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
-  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
-
-  // How to copy data back to the CPU side from the GPU
-  cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  checkCUDAErrorWithLine("memcpy back failed!");
-
-  std::cout << "after unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // cleanup
-  cudaFree(dev_intKeys);
-  cudaFree(dev_intValues);
-  checkCUDAErrorWithLine("cudaFree failed!");
-  return;
+	// LOOK-1.2 Feel free to write additional tests here.
+
+	// test unstable sort
+	int *dev_intKeys;
+	int *dev_intValues;
+	int N = 10;
+
+	std::unique_ptr<int[]>intKeys{ new int[N] };
+	std::unique_ptr<int[]>intValues{ new int[N] };
+
+	intKeys[0] = 0; intValues[0] = 0;
+	intKeys[1] = 1; intValues[1] = 1;
+	intKeys[2] = 0; intValues[2] = 2;
+	intKeys[3] = 3; intValues[3] = 3;
+	intKeys[4] = 0; intValues[4] = 4;
+	intKeys[5] = 2; intValues[5] = 5;
+	intKeys[6] = 2; intValues[6] = 6;
+	intKeys[7] = 0; intValues[7] = 7;
+	intKeys[8] = 5; intValues[8] = 8;
+	intKeys[9] = 6; intValues[9] = 9;
+
+	cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+
+	cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	std::cout << "before unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// How to copy data to the GPU
+	cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+	thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+	// How to copy data back to the CPU side from the GPU
+	cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+
+	std::cout << "after unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// cleanup
+	cudaFree(dev_intKeys);
+	cudaFree(dev_intValues);
+	checkCUDAErrorWithLine("cudaFree failed!");
+	return;
 }
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..e3559ec 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,18 +7,18 @@
 */
 
 #include "main.hpp"
-
+#include <windows.h>
 // ================
 // Configuration
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**
@@ -26,7 +26,7 @@ const float DT = 0.2f;
 */
 int main(int argc, char* argv[]) {
   projectName = "565 CUDA Intro: Boids";
-
+  Sleep(40);
   if (init(argc, argv)) {
     mainLoop();
     Boids::endSimulation();