diff --git a/README.md b/README.md
index d63a6a1..f366056 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,119 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dewang Sultania
+  * [LinkedIn](https://www.linkedin.com/in/dewang-sultania)
+* Tested on: Windows 10, Intel Xeon E-2176M @ 2.70GHz 16GB, Quadro P2000 4GB (Personal Computer)
+* Cuda Compute Capability: 6.1
 
-### (TODO: Your README)
+![](images/boids.gif)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Boids with Naive Neighbor Search
+
+In the Boids flocking simulation, particles representing boids move around in simulation space according to three rules:
+- cohesion - boids move towards the perceived center of mass of their neighbors
+- separation - boids avoid getting to close to their neighbors 
+- alignment - boids generally try to move with the same direction and speed as their neighbors
+
+These three rules specify a boid's velocity change in a timestep . At every timestep, a boid thus has to look at each of its neighboring boids and compute the velocity change contribution from each of the three rules. Thus, a bare-bones boids implementation has each boid check every other boid in the simulation.  The following is the pseudocode for the rules:
+
+#### Rule 1: Boids try to fly towards the centre of mass of neighbouring boids
+
+```
+function rule1_adhesion(Boid boid)
+
+    Vector perceived_center
+
+    foreach Boid b:
+        if b != boid and distance(b, boid) < rule1Distance then
+            perceived_center += b.position
+        endif
+    end
+
+    perceived_center /= number_of_neighbors
+
+    return (perceived_center - boid.position) * rule1Scale
+end
+```
+
+#### Rule 2: Boids try to keep a small distance away from other objects (including other boids).
+
+```
+function rule2_avoidance(Boid boid)
+
+    Vector c = 0
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule2Distance then
+            c -= (b.position - boid.position)
+        endif
+    end
+
+    return c * rule2Scale
+end
+```
+
+#### Rule 3: Boids try to match velocity with near boids.
+
+```
+function rule3_cohesion(Boid boid)
+
+    Vector perceived_velocity
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule3Distance then
+            perceived_velocity += b.velocity
+        endif
+    end
+
+    perceived_velocity /= number_of_neighbors
+
+    return perceived_velocity * rule3Scale
+end
+```
+The final updated velocity is = current_velocity + rule1_adhesion + rule2_avoidance+ rule3_cohesion.
+
+The CUDA implementation of this pseudocode can be found in the kernel.cu file in the functions ```__global__ void kernUpdateVelocityBruteForce``` and ``void Boids::stepSimulationNaive``
+
+## Uniform Grids for Better Flocking
+
+Having each boid check every other boid is inefficient, especially if the number of boids is large and neighborhood distance is much smaller than full simulation space. We can cull a lot of neighbor checks using a datastructure called uniform spacial grid. A uniform grid is made up of cells that are at least as wide as the neighborhood distance and covers the entire simulation domain. Before computing the new velocities of the boids, we "bin" them into the grid in a preprocess step.
+
+![](images/Boids%20Ugrid%20base.png)
+
+If we label each boid with an index representing its enclosing cell and then sort the list of boids by these indices, we can ensure that the pointers to boids in the same cell are contiguous in memory. 
+
+Then, we can walk over the array of sorted uniform grid indices and look at every pair of values. If the values differ, we know that we are at the border of the representation of two different cells. Storing these locations in a table with an entry for each cell gives us a complete representation of the uniform grid. This "table" can just be an array with as much space as there are cells. This process is data parallel and can be naively parallelized.
+
+![](images/Boids%20Ugrids%20buffers%20naive.png)
+
+The CUDA implementation of this step can be found in the kernel.cu file in the functions ``Boids::stepSimulationScatteredGrid`` and ``kernUpdateVelNeighborSearchScattered``.  To toggle this mode on set the UNIFORM_GRID variable as 1 in main.cpp
+
+To further improve the performance we can cut out the middleman, In the previous step  pointers to boids in a single cell are contiguous in memory, but the boid data itself (velocities and positions) is scattered all over the place. We can rearrange the boid itself so the velocities and positions of boids in the cell are also contiguous in memory so the data can directly be accessed using the cell data pointers with no requirement for pointers to particle array indices.
+
+![](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+## Performance Analysis
+
+### Number of Boids vs Frames per second (Visualization on)
+
+![](images/visualize_on.JPG)
+
+![](images/trend_visualize_on.JPG)
+
+### Number of Boids vs Frames per second (Visualization off)
+
+![](images/visualize_off.JPG)
+
+![](images/trend_visualize_off.JPG)
+
+### Number of Blocks vs Frames per second (Number of boids as 50,000)
+
+![](images/block.JPG)
+
+### Some insights:
+
+- In case of Naive neighbor search the drop with increase in number of boids is exponential because as the number of boids increase the search space becomes huge and updating the positions and velocities become more costly. Scattered neighbor search and coherent neighbor search have a linear decrease in performance. When it comes to comparison in performance coherent is atleast as good as scattered in the initial cases and gives atleast 3 times better performance than scattered when the number of boids are large. 
+- For block size less than 32, the performance increase is significant as block size increases because the warp size of GPU is 32 and block sizes less than that are not able to utilize a warp, after 32, for multiples of 32 the performance is better than the block sizes that are not multiples of 32 because the warps are not fully utilized in the latter case.
+- When the number of boids grows to more than 50,000 the other two algorithms fail to perform whereas coherent grid is still able to maintain a FPS of around 40. This is because it has to maintain and manage less pointers than its counterparts.
+- The way, I had implemented the project, it was not trivial to change 8 neighbors to 27, so to test this out, I changed my one loop of 8 neighbors and bitwise operations to 3 loops and checked it. I did not push that 3 loop code because I found the bitwise operation code to be more elegant although it cannot be easily extended to include more neighbors. On changing the number of neighbors from 8 to 27, the performance was not always worse. This was because if we carefully manipulate the cell width there won't be any significant increase in the number of boids we check.
\ No newline at end of file
diff --git a/images/block.JPG b/images/block.JPG
new file mode 100644
index 0000000..c6e0749
Binary files /dev/null and b/images/block.JPG differ
diff --git a/images/boids.gif b/images/boids.gif
new file mode 100644
index 0000000..eeb8b0b
Binary files /dev/null and b/images/boids.gif differ
diff --git a/images/trend_visualize_off.JPG b/images/trend_visualize_off.JPG
new file mode 100644
index 0000000..e784831
Binary files /dev/null and b/images/trend_visualize_off.JPG differ
diff --git a/images/trend_visualize_on.JPG b/images/trend_visualize_on.JPG
new file mode 100644
index 0000000..8379e86
Binary files /dev/null and b/images/trend_visualize_on.JPG differ
diff --git a/images/visualize_off.JPG b/images/visualize_off.JPG
new file mode 100644
index 0000000..2bf8cb5
Binary files /dev/null and b/images/visualize_off.JPG differ
diff --git a/images/visualize_on.JPG b/images/visualize_on.JPG
new file mode 100644
index 0000000..d9ac9d5
Binary files /dev/null and b/images/visualize_on.JPG differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..1e085fb 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -69,6 +69,8 @@ dim3 threadsPerBlock(blockSize);
 glm::vec3 *dev_pos;
 glm::vec3 *dev_vel1;
 glm::vec3 *dev_vel2;
+glm::vec3 *dev_sortedVel;
+glm::vec3 *dev_sortedPos;
 
 // LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
 // pointers on your own too.
@@ -85,7 +87,6 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
-
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -169,7 +170,25 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+  cudaMalloc((void**)&dev_sortedVel, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_sortedVel failed!");
+  cudaMalloc((void**)&dev_sortedPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_sortedPos failed!");
+
+
+
   cudaDeviceSynchronize();
+
 }
 
 
@@ -230,10 +249,44 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  
+ 
+ 
+	glm::vec3 perceived_center(0, 0, 0);
+	glm::vec3 c(0, 0, 0);
+	glm::vec3 perceived_velocity(0, 0, 0);
+	glm::vec3 velISelf = vel[iSelf];
+	int n_rule1 = 0, n_rule3 = 0;
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf)
+		{
+			float distance = glm::distance(pos[iSelf], pos[i]);
+			// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+			if (distance < rule1Distance){
+				perceived_center += pos[i];
+				n_rule1++;
+			}
+			// Rule 2: boids try to stay a distance d away from each other
+			if (distance < rule2Distance)
+				c -= pos[i] - pos[iSelf];
+			// Rule 3: boids try to match the speed of surrounding boids
+			if (distance < rule3Distance) {
+				perceived_velocity += vel[i];
+				n_rule3++;
+			}
+		}
+	}
+	if (n_rule1 > 0) {
+		perceived_center /= n_rule1;
+		velISelf = velISelf + (perceived_center - pos[iSelf])*rule1Scale;
+	}
+	if (n_rule3 > 0) {
+		perceived_velocity /= n_rule3;
+		velISelf = velISelf + perceived_velocity * rule3Scale;
+	}
+	velISelf = velISelf + c * rule2Scale;
+	
+	return velISelf;
 }
 
 /**
@@ -242,9 +295,19 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
+  
   // Record the new velocity into vel2. Question: why NOT vel1?
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVelocity = computeVelocityChange(N, index, pos, vel1);
+	// Clamp the speed
+	if (glm::length(newVelocity) > maxSpeed)
+		newVelocity = glm::normalize(newVelocity)* maxSpeed;
+	vel2[index] = newVelocity;
+	
 }
 
 /**
@@ -287,8 +350,18 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 boidpos3D = (pos[index] - gridMin)*inverseCellWidth;
+	int gridCellIndex = gridIndex3Dto1D(boidpos3D.x, boidpos3D.y, boidpos3D.z, gridResolution);
+	gridIndices[index] = gridCellIndex;
+	
+
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -298,6 +371,7 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   if (index < N) {
     intBuffer[index] = value;
   }
+
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
@@ -306,6 +380,19 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	if (index == 0) {
+		gridCellStartIndices[particleGridIndices[index]] = 0;
+	}
+	else if (particleGridIndices[index] != particleGridIndices[index - 1]) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+		gridCellEndIndices[particleGridIndices[index-1]] = index-1;
+	}
+	if (index == N - 1)
+		gridCellEndIndices[particleGridIndices[index]] = index;
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -316,32 +403,146 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
+	int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 perceived_center(0, 0, 0);
+	glm::vec3 c(0, 0, 0);
+	glm::vec3 perceived_velocity(0, 0, 0);
+	glm::vec3 velocity = vel1[index];
+	int n_rule1 = 0, n_rule3 = 0;
+	// - Identify the grid cell that this particle is in
+	glm::vec3 gridCellIndexOfTheParticle = (pos[index]-gridMin)*inverseCellWidth;
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
+
+	for (int i = 0; i < 8; i++) {
+		int x = imax((int)gridCellIndexOfTheParticle.x - ((i & 1)?1:0), 0);
+		int y = imax((int)gridCellIndexOfTheParticle.y - ((i & 2)?1:0), 0);
+		int z = imax((int)gridCellIndexOfTheParticle.z - ((i & 4)?1:0), 0);
+		int gridCellIndex1d = gridIndex3Dto1D(x, y, z, gridResolution);
+		if (gridCellStartIndices[gridCellIndex1d] >= 0) {
+			// - For each cell, read the start/end indices in the boid pointer array.
+			int start = gridCellStartIndices[gridCellIndex1d];
+			int end = gridCellEndIndices[gridCellIndex1d];
+			// - Access each boid in the cell and compute velocity change from
+			//   the boids rules, if this boid is within the neighborhood distance.
+			for (int j = start; j <= end; j++) {
+				int boidIndex = particleArrayIndices[j];
+				if (boidIndex != index) {
+					float distance = glm::distance(pos[index], pos[boidIndex]);
+					// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+					if (distance < rule1Distance) {
+						perceived_center += pos[boidIndex];
+						n_rule1++;
+					}
+					// Rule 2: boids try to stay a distance d away from each other
+					if (distance < rule2Distance)
+						c -= pos[boidIndex] - pos[index];
+					// Rule 3: boids try to match the speed of surrounding boids
+					if (distance < rule3Distance) {
+						perceived_velocity += vel1[boidIndex];
+						n_rule3++;
+					}
+				}
+			}
+		}
+	}
+	if (n_rule1 > 0) {
+		perceived_center /= n_rule1;
+		velocity = velocity + (perceived_center - pos[index])*rule1Scale;
+	}
+	if (n_rule3 > 0) {
+		perceived_velocity /= n_rule3;
+		velocity = velocity + perceived_velocity * rule3Scale;
+	}
+	velocity = velocity + c * rule2Scale;
   // - Clamp the speed change before putting the new speed in vel2
+	if (glm::length(velocity) > maxSpeed)
+		velocity = glm::normalize(velocity)* maxSpeed;
+	vel2[index] = velocity;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
-}
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+		// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+		// except with one less level of indirection.
+		// This should expect gridCellStartIndices and gridCellEndIndices to refer
+		// directly to pos and vel1.
+		int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (index >= N) {
+			return;
+		}
+		glm::vec3 perceived_center(0, 0, 0);
+		glm::vec3 c(0, 0, 0);
+		glm::vec3 perceived_velocity(0, 0, 0);
+		glm::vec3 velocity = vel1[index];
+		int n_rule1 = 0, n_rule3 = 0;
+		// - Identify the grid cell that this particle is in
+		glm::vec3 gridCellIndexOfTheParticle = (pos[index] - gridMin)*inverseCellWidth;
+		// - Identify which cells may contain neighbors. This isn't always 8.
+
+		for (int i = 0; i < 8; i++) {
+			int x = imax((int)gridCellIndexOfTheParticle.x - ((i & 1) ? 1 : 0), 0);
+			int y = imax((int)gridCellIndexOfTheParticle.y - ((i & 2) ? 1 : 0), 0);
+			int z = imax((int)gridCellIndexOfTheParticle.z - ((i & 4) ? 1 : 0), 0);
+			int gridCellIndex1d = gridIndex3Dto1D(x, y, z, gridResolution);
+			if (gridCellStartIndices[gridCellIndex1d] >= 0) {
+				// - For each cell, read the start/end indices in the boid pointer array.
+				int start = gridCellStartIndices[gridCellIndex1d];
+				int end = gridCellEndIndices[gridCellIndex1d];
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				//   DIFFERENCE: For best results, consider what order the cells should be
+
+				for (int j = start; j <= end; j++) {
+					if (j != index) {
+						float distance = glm::distance(pos[index], pos[j]);
+						// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+						if (distance < rule1Distance) {
+							perceived_center += pos[j];
+							n_rule1++;
+						}
+						// Rule 2: boids try to stay a distance d away from each other
+						if (distance < rule2Distance)
+							c -= pos[j] - pos[index];
+						// Rule 3: boids try to match the speed of surrounding boids
+						if (distance < rule3Distance) {
+							perceived_velocity += vel1[j];
+							n_rule3++;
+						}
+					}
+				}
+			}
+		}
+		if (n_rule1 > 0) {
+			perceived_center /= n_rule1;
+			velocity = velocity + (perceived_center - pos[index])*rule1Scale;
+		}
+		if (n_rule3 > 0) {
+			perceived_velocity /= n_rule3;
+			velocity = velocity + perceived_velocity * rule3Scale;
+		}
+		velocity = velocity + c * rule2Scale;
+		// - Clamp the speed change before putting the new speed in vel2
+		if (glm::length(velocity) > maxSpeed)
+			velocity = glm::normalize(velocity)* maxSpeed;
+		vel2[index] = velocity;
+
+	}
+
+__global__ void kernSortPositionVelocity(int N, int *particleArrayIndices, glm::vec3 *pos, glm::vec3 *vel1,
+	glm::vec3 *sorted_vel, glm::vec3 *sorted_pos) {
+		int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+		if (index >= N) {
+			return;
+		}
+		sorted_pos[index] = pos[particleArrayIndices[index]];
+		sorted_vel[index] = vel1[particleArrayIndices[index]];
+	}
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
@@ -349,39 +550,82 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_gridCellEndIndices, -1);
+	kernComputeIndices <<<fullBlocksPerGrid, blockSize>>> (numObjects, gridSideCount, gridMinimum, 
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, 
+		dev_thrust_particleArrayIndices);
   // - Naively unroll the loop for finding the start and end indices of each
+
+	kernIdentifyCellStartEnd <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_particleGridIndices,
+			dev_gridCellStartIndices, dev_gridCellEndIndices);
   //   cell's data pointers in the array of boid indices
   // - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered <<<fullBlocksPerGrid, blockSize>>> (numObjects, gridSideCount, gridMinimum
+			, gridInverseCellWidth, gridCellWidth ,dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+			dev_pos, dev_vel1, dev_vel2);
   // - Update positions
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos, dev_vel2);
   // - Ping-pong buffers as needed
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	 // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	// In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices);
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	// - Perform velocity updates using neighbor search
+	kernSortPositionVelocity << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos
+		, dev_vel1, dev_sortedVel, dev_sortedPos);
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum
+			, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+			dev_sortedPos, dev_sortedVel, dev_vel2);
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_sortedPos, dev_vel2);
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.	
+	std::swap(dev_vel1, dev_vel2);
+	std::swap(dev_pos, dev_sortedPos);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +634,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_sortedPos);
+  cudaFree(dev_sortedVel);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..ddd0e3b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;