diff --git a/README.md b/README.md
index d63a6a1..397b1de 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,222 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Author : Kushagra
+  * [LinkedIn](https://www.linkedin.com/in/kushagragoel/)
+* Tested on : Windows 10, i7-9750H CPU @ 2.60GHz 16GB, GTX 1650 4GB (Personal Computer)
 
-### (TODO: Your README)
+____________________________________________________________________________________
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+# Boids
+
+## What are Boids?
+Boids is an artificial life program developed by Craig W. Reynolds which simulates a very common flocking behavior found in nature among animals like birds and fishes. A boid shows an emergent behavior which is directly dependent on its neighbors. Its behavior arises from 3 main simple rules :
+
+* **separation**: steer to avoid crowding local flockmates
+
+* **alignment**: steer towards the average heading of local flockmates
+
+* **cohesion**: steer to move towards the average position (center of mass) of local flockmates
+
+Other rules can be added like obstacle avoidance and goal seeking. In this project we simulate the Boids Flocking with 3 different implementations :
+
+* **Naive** : Each boid checks all other boids to calculate the behavior.
+* **Uniform Grid** : Each boid checks all other boids in surrounding cells to calculate the behavior.
+* **Uniform Grid with coherent memory access **: Each boid checks all other boids in surrounding cells to calculate the behavior. Here we reshuffle the boids to remove extra memory accesses.
+
+Example visuals of Uniform grid with coherent memory access, the most efficient implementation.
+
+| 5,000 Boids | 50,000 Boids |
+| ------------- | ----------- |
+| ![](img/Coherent5000.gif)  | ![](img/Coherent50000.gif) |
+
+## Rules
+In the Boids flocking simulation, particles representing birds or fish (boids) move around the simulation space according to three rules:
+
+* Cohesion - boids move towards the perceived center of mass of their neighbors
+
+* Separation - boids avoid getting to close to their neighbors
+
+* Alignment - boids generally try to move with the same direction and speed as
+  their neighbors
+
+These three rules specify a boid's velocity change in a timestep.
+At every timestep, a boid thus has to look at each of its neighboring boids
+and compute the velocity change contribution from each of the three rules.
+Thus, a bare-bones boids implementation has each boid check every other boid in
+the simulation.
+
+#### Rule 1: Boids try to fly towards the centre of mass of neighbouring boids
+
+```
+function rule1(Boid boid)
+
+    Vector perceived_center
+
+    foreach Boid b:
+        if b != boid and distance(b, boid) < rule1Distance then
+            perceived_center += b.position
+        endif
+    end
+
+    perceived_center /= number_of_neighbors
+
+    return (perceived_center - boid.position) * rule1Scale
+end
+```
+
+#### Rule 2: Boids try to keep a small distance away from other objects (including other boids).
+
+```
+function rule2(Boid boid)
+
+    Vector c = 0
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule2Distance then
+            c -= (b.position - boid.position)
+        endif
+    end
+
+    return c * rule2Scale
+end
+```
+
+#### Rule 3: Boids try to match velocity with near boids.
+
+```
+function rule3(Boid boid)
+
+    Vector perceived_velocity
+
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule3Distance then
+            perceived_velocity += b.velocity
+        endif
+    end
+
+    perceived_velocity /= number_of_neighbors
+
+    return perceived_velocity * rule3Scale
+end
+```
+Based on [Conard Parker's notes](http://www.vergenet.net/~conrad/boids/pseudocode.html) with slight adaptations. For the purposes of an interesting simulation,
+we will say that two boids only influence each other according if they are
+within a certain **neighborhood distance** of each other.
+
+Since these only affect the delta by which a current velocity should be updated, the updated velocity for a current value is as follows: `updated_vel = current_vel + rule1_adhesion + rule2_avoidance_dodging + rule3_cohesion
+
+## Part 1: Boids with Naive Neighbor Search
+
+We simply check each boid against every other boid to see if they are within a certain neighborhood distance of each other. 
+
+## Part 2: Let there be (better) flocking!
+
+### A quick explanation of uniform grids
+
+From Part 1, we observe that any two boids can only influence each other if they are
+within some *neighborhood distance* of each other.
+Based on this observation, we can see that having each boid check every
+other boid is very inefficient, especially if (as in our standard parameters)
+the number of boids is large and the neighborhood distance is much smaller than
+the full simulation space. We can cull a lot of neighbor checks using a
+datastructure called a **uniform spatial grid**.
+
+A uniform grid is made up of cells that are at least as wide as the neighborhood
+distance and covers the entire simulation domain.
+Before computing the new velocities of the boids, we "bin" them into the grid in
+a preprocess step.
+![a uniform grid in 2D](images/Boids%20Ugrid%20base.png)
+
+If the cell width is double the neighborhood distance, each boid only has to be
+checked against other boids in 8 cells, or 4 in the 2D case.
+
+![a uniform grid in 2D with neighborhood and cells to search for some particles shown](images/Boids%20Ugrid%20neighbor%20search%20shown.png)
+
+We can build a uniform grid on the CPU by iterating over the boids, figuring out
+its enclosing cell, and then keeping a pointer to the boid in a resizeable
+array representing the cell. However, this doesn't transfer well to the GPU
+because:
+
+1. We don't have resizeable arrays on the GPU
+2. Naively parallelizing the iteration may lead to race conditions, where two
+particles need to be written into the same bucket on the same clock cycle.
+
+Instead, we will construct the uniform grid by sorting. If we label each boid
+with an index representing its enclosing cell and then sort the list of
+boids by these indices, we can ensure that pointers to boids in the same cells
+are contiguous in memory.
+
+Then, we can walk over the array of sorted uniform grid indices and look at
+every pair of values. If the values differ, we know that we are at the border
+of the representation of two different cells. Storing these locations in a table
+with an entry for each cell gives us a complete representation of the uniform
+grid. This "table" can just be an array with as much space as there are cells.
+This process is data parallel and can be naively parallelized.
+![buffers for generating a uniform grid using index sort](images/Boids%20Ugrids%20buffers%20naive.png)
+
+## Part 3 Cutting out the middleman : Coherent Memory Access
+Consider the uniform grid neighbor search ,pointers to boids in
+a single cell are contiguous in memory, but the boid data itself (velocities and
+positions) is scattered all over the place. So we rearrange the boid data
+itself so that all the velocities and positions of boids in one cell are also
+contiguous in memory.
+
+![buffers for generating a uniform grid using index sort, then making the boid data coherent](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+
+## Additional Optimizations
+
+**Spherical Optimization**
+
+We implemented additional optimizations based on the fact that the neighborhood of a boid is essentially a sphere of the appropriate radius. But by checking all the cells around it wastes computation on the cells for which no part of the sphere is inside the cell. We can eliminate some of these cells by computing the distance of their centroid from the boid and see if they are within the (radius) + (half of the diagonal) distance of each other.
+
+**Grid-Looping Optimization**
+
+Another optimization we implemented was allowing variable cell-width. The system automatically computes what all cells does it need to check in order to get all the neighbors. This is done by calculating the enclosing cube of the sphere (whose radius will be the max of all 3 rule radii) that defines the boundary of neighborhood. Then it calculates what all cells have a part of that approximate cube and then finds all the neighbors.
+
+# Runtime Analysis
+
+**Abbreviations**
+
+* V : Visualize is 1
+
+* NV : Visualize is 0
+
+* UG : Uniform Grid
+
+* UGS : Uniform Grid with Spherical Optimization
+
+* Coherent : Uniform Grid with coherent memory access
+
+* CoherentS : Uniform Grid with coherent memory access and Spherical Optimization
+
+* **Effect of Number of Boids on Frames Per Second**
+
+  * **For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+      * We observe that the number of frames go down as Number of boids increases. This is due to the extra computation that comes with each new boid. Since the density increases, the number of boids in neighborhood of a boid also increase. 
+  
+
+<img src="img/FPSvNumBoids.png" alt="NumBoidsVsFPS" width="600"/>  
+
+* **Effect of Block Size of Frames Per Second**
+
+  * **For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+  * We observe that Block Size effects the FPS negligibly. Although degraded performance was observed for BlockSizes that were not multiple of 32. This might be due to the fact that the warp size is 32, so for any warp that is not started with all 32 threads engaged, the performance is lower compared to the warps that have 100% usage.
+  
+
+<img src="img/FPSvBlockSize.png" alt="BlockSizeVsFPS" width="600"/>  
+
+* **Effect of Cell Width on Frames Per Second**
+
+  * **Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+  
+    * The effect of Cell Width was tested with 50,000 boids. We see an interesting peak when cell width is exactly the size of the maximum rule radii. This is possibly due to the fact we have a lot of boids and for case where the cell width is greater than 1, the volume covered increases and therefore increasing the number of boids to be checked if they are in the neighborhood. Thus with bigger cell width, more and more boids that are outside the neighborhood sphere increases therefore leading to lower overall performance.
+  
+  <img src="img/FPSvCellWidth.png" alt="CellWidthVsFPS" width="600"/>  
+  
+* **For the coherent uniform grid: did you experience any performance improvements**
+  **with the more coherent uniform grid? Was this the outcome you expected?**
+  **Why or why not?**
+
+  * The performance depended on number of boids. For lower number of boids (in tens and hundreds) the Uniform Grid actually performs better but for high number of boids ( in hundred thousands) the more coherent uniform grid was massively faster than normal uniform grid. This was expected because the reshuffling in coherent access incurs several overheads compared to simple uniform grid. But when we have several thousands of boids, the improvement in latency due coherent memory access and time saved by not waiting for memory fetch allows coherent memory to have higher computation throughput compared to normal uniform grid.
\ No newline at end of file
diff --git a/img/Coherent5000.gif b/img/Coherent5000.gif
new file mode 100644
index 0000000..2af7462
Binary files /dev/null and b/img/Coherent5000.gif differ
diff --git a/img/Coherent50000.gif b/img/Coherent50000.gif
new file mode 100644
index 0000000..f69a865
Binary files /dev/null and b/img/Coherent50000.gif differ
diff --git a/img/FPSvBlockSize.png b/img/FPSvBlockSize.png
new file mode 100644
index 0000000..3bb5da1
Binary files /dev/null and b/img/FPSvBlockSize.png differ
diff --git a/img/FPSvCellWidth.png b/img/FPSvCellWidth.png
new file mode 100644
index 0000000..7d8fc99
Binary files /dev/null and b/img/FPSvCellWidth.png differ
diff --git a/img/FPSvNumBoids.png b/img/FPSvNumBoids.png
new file mode 100644
index 0000000..e6a2a8d
Binary files /dev/null and b/img/FPSvNumBoids.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..8ae9c0b 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *coherent_pos;
+glm::vec3 *coherent_vel1;
+glm::vec3 *coherent_vel2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -145,12 +148,22 @@ void Boids::initSimulation(int N) {
   cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
 
+  cudaMalloc((void**)&coherent_pos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc coherent_pos failed!");
+
   cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
 
+  cudaMalloc((void**)&coherent_vel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc coherent_vel1 failed!");
+
   cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
+  cudaMalloc((void**)&coherent_vel2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc coherent_vel2 failed!");
+
+
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
@@ -169,6 +182,22 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +262,54 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	glm::vec3 rule1vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule2vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule3vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule1pos = glm::vec3(0.0f, 0.0f, 0.0f);
+
+	int rule1num = 0;
+	int rule3num = 0;
+	
+	for (int i = 0; i < N; i++) {
+		// If its the boid itself
+		if (iSelf == i) {
+			continue;
+		}
+		// Add position for calculating center of mass
+		// Keeping count of neighbours in given radius
+		if (glm::distance(pos[i], pos[iSelf]) < rule1Distance) {
+			rule1num++;
+			rule1pos += pos[i];
+		}
+		// Add position for calculating center of mass
+		if (glm::distance(pos[i], pos[iSelf]) < rule2Distance) {
+			rule2vel -= (pos[i] - pos[iSelf]);
+		}
+		// Add position for calculating center of mass
+		// Keeping count of neighbours in given radius
+		if (glm::distance(pos[i], pos[iSelf]) < rule3Distance) {
+			rule3num++;
+			rule3vel += vel[i];
+		}
+	}
+
+	// Finding center of mass and calculating velocity
+	if (rule1num > 0) {
+		rule1pos /= rule1num;
+		rule1vel = (rule1pos - pos[iSelf]) * rule1Scale;
+	}
+
+	// Velocity to keep distance from other boids
+	rule2vel *= rule2Scale;
+
+	// Calculating average velocity of neighbouring boids
+	if (rule3num > 0) {
+		rule3vel /= rule3num;
+		rule3vel *= rule3Scale;
+	}
+
+  return rule1vel + rule2vel + rule3vel;
 }
 
 /**
@@ -242,9 +318,21 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
   // Compute a new velocity based on pos and vel1
+	glm::vec3 newVelocity = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
+	if (glm::length(newVelocity) > maxSpeed) {
+		newVelocity = glm::normalize(newVelocity)*maxSpeed;
+	}
   // Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = newVelocity;
 }
 
 /**
@@ -286,9 +374,25 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
     // - Label each boid with the index of its grid cell.
+
+	glm::vec3 gridCellIndexFloat = (pos[index] - gridMin) * inverseCellWidth;
+	int xCoordinate = (int)gridCellIndexFloat.x;
+	int yCoordinate = (int)gridCellIndexFloat.y;
+	int zCoordinate = (int)gridCellIndexFloat.z;
+	int cellIndex = gridIndex3Dto1D(xCoordinate, yCoordinate, zCoordinate, gridResolution);
+	gridIndices[index] = cellIndex;
+
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -303,7 +407,27 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// // Identify the start point of each cell in the gridIndices array.
+	if (index + 1 >= N) {
+		gridCellEndIndices[particleGridIndices[index]] = index;
+		return;
+	}
+
+	if (index == 0) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+	}
+
+	if (particleGridIndices[index] != particleGridIndices[index + 1]) {
+		gridCellStartIndices[particleGridIndices[index + 1]] = index + 1;
+		gridCellEndIndices[particleGridIndices[index]] = index;
+	}
+
+
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
 }
@@ -314,14 +438,116 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+
+
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
   // - Identify the grid cell that this particle is in
+
+	glm::vec3 currentCellCoords = glm::floor(glm::vec3(pos[index] - gridMin) * inverseCellWidth);
+
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	const float radius = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+	glm::vec3 lowerBoundCellCoord = glm::floor((pos[index] - gridMin - glm::vec3(radius)) * inverseCellWidth);
+	glm::vec3 upperBoundCellCoord = glm::floor((pos[index] - gridMin + glm::vec3(radius)) * inverseCellWidth);
+
+	glm::vec3 rule1vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule2vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule3vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule1pos = glm::vec3(0.0f, 0.0f, 0.0f);
+
+	int rule1num = 0;
+	int rule3num = 0;
+
+	float radiusPlusRootThreeByTwoCellWidth = radius + (sqrtf(3)*cellWidth / 2);
+
+ // // - For each cell, read the start/end indices in the boid pointer array.
+	for (int z = imax(0, lowerBoundCellCoord.z); z <= imin(gridResolution - 1, upperBoundCellCoord.z); z++) {
+		for (int y = imax(0, lowerBoundCellCoord.y); y <= imin(gridResolution - 1, upperBoundCellCoord.y); y++) {
+			for (int x = imax(0, lowerBoundCellCoord.x); x <= imin(gridResolution - 1, upperBoundCellCoord.x); x++) {
+
+				// To check if the cube is outside the sphere by checking if the cube centroid is farther than
+				// radius + (sqrt(3)/2)*cellWidth from the center.
+				glm::vec3 cubeCentroid((x + 0.5)*cellWidth, (y + 0.5)*cellWidth, (z + 0.5)*cellWidth);
+				if (glm::distance(cubeCentroid + gridMin, pos[index]) > radiusPlusRootThreeByTwoCellWidth) {
+					continue;
+				}
+
+				int currentCellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+				int startIndex = gridCellStartIndices[currentCellIndex];
+				int endIndex = gridCellEndIndices[currentCellIndex];
+				if (endIndex < 0) {
+					continue;
+				}
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				for (int neighbourArrayIndex = startIndex; neighbourArrayIndex <= endIndex; neighbourArrayIndex++) {
+					int currentNeighbour = particleArrayIndices[neighbourArrayIndex];
+
+					// If its the boid itself
+					if (index == currentNeighbour) {
+						continue;
+					}
+					// Add position for calculating center of mass
+					// Keeping count of neighbours in given radius
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule1Distance) {
+						rule1num++;
+						rule1pos += pos[currentNeighbour];
+					}
+					// Add position for calculating center of mass
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule2Distance) {
+						rule2vel -= (pos[currentNeighbour] - pos[index]);
+					}
+					// Add position for calculating center of mass
+					// Keeping count of neighbours in given radius
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule3Distance) {
+						rule3num++;
+						rule3vel += vel1[currentNeighbour];
+					}
+				}
+			}
+		}
+	}
+
+	// Finding center of mass and calculating velocity
+	if (rule1num > 0) {
+		rule1pos /= rule1num;
+		rule1vel = (rule1pos - pos[index]) * rule1Scale;
+	}
+
+	// Velocity to keep distance from other boids
+	rule2vel *= rule2Scale;
+
+	// Calculating average velocity of neighbouring boids
+	if (rule3num > 0) {
+		rule3vel /= rule3num;
+		rule3vel *= rule3Scale;
+	}
+
+
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVelocity = vel1[index] + rule1vel + rule2vel + rule3vel;
+	if (glm::length(newVelocity) > maxSpeed) {
+		newVelocity = glm::normalize(newVelocity)*maxSpeed;
+	}
+	// - Clamp the speed change before putting the new speed in vel2
+	vel2[index] = newVelocity;
+}
+
+
+__global__ void kernReshuffle(int N, int *indices, glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *posCoherent, glm::vec3 *vel1Coherent) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	int oldIndex = indices[index];
+	posCoherent[index] = pos[oldIndex];
+	vel1Coherent[index] = vel1[oldIndex];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,18 +555,109 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
+
+
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// - Identify the grid cell that this particle is in
+
+	glm::vec3 currentCellCoords = glm::floor(glm::vec3(pos[index] - gridMin) * inverseCellWidth);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	const float radius = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+	glm::vec3 lowerBoundCellCoord = glm::floor((pos[index] - gridMin - glm::vec3(radius)) * inverseCellWidth);
+	glm::vec3 upperBoundCellCoord = glm::floor((pos[index] - gridMin + glm::vec3(radius)) * inverseCellWidth);
+
+
+	glm::vec3 rule1vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule2vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule3vel = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 rule1pos = glm::vec3(0.0f, 0.0f, 0.0f);
+
+	int rule1num = 0;
+	int rule3num = 0;
+
+	float radiusPlusRootThreeByTwoCellWidth = radius + (sqrtf(3)*cellWidth / 2);
+
+	// // - For each cell, read the start/end indices in the boid pointer array.
   //   DIFFERENCE: For best results, consider what order the cells should be
   //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	for (int z = imax(0, lowerBoundCellCoord.z); z <= imin(gridResolution - 1, upperBoundCellCoord.z); z++) {
+		for (int y = imax(0, lowerBoundCellCoord.y); y <= imin(gridResolution - 1, upperBoundCellCoord.y); y++) {
+			for (int x = imax(0, lowerBoundCellCoord.x); x <= imin(gridResolution - 1, upperBoundCellCoord.x); x++) {
+
+				// To check if the cube is outside the sphere by checking if the cube centroid is farther than
+				// radius + (sqrt(3)/2)*cellWidth from the center.
+				glm::vec3 cubeCentroid((x + 0.5)*cellWidth, (y + 0.5)*cellWidth, (z + 0.5)*cellWidth);
+				if (glm::distance(cubeCentroid + gridMin, pos[index]) > radiusPlusRootThreeByTwoCellWidth) {
+					continue;
+				}
+
+				int currentCellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+				int startIndex = gridCellStartIndices[currentCellIndex];
+				int endIndex = gridCellEndIndices[currentCellIndex];
+				if (endIndex < 0) {
+					continue;
+				}
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				for (int currentNeighbour = startIndex; currentNeighbour <= endIndex; currentNeighbour++) {
+
+					// If its the boid itself
+					if (index == currentNeighbour) {
+						continue;
+					}
+					// Add position for calculating center of mass
+					// Keeping count of neighbours in given radius
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule1Distance) {
+						rule1num++;
+						rule1pos += pos[currentNeighbour];
+					}
+					// Add position for calculating center of mass
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule2Distance) {
+						rule2vel -= (pos[currentNeighbour] - pos[index]);
+					}
+					// Add position for calculating center of mass
+					// Keeping count of neighbours in given radius
+					if (glm::distance(pos[currentNeighbour], pos[index]) < rule3Distance) {
+						rule3num++;
+						rule3vel += vel1[currentNeighbour];
+					}
+				}
+			}
+		}
+	}
+
+	// Finding center of mass and calculating velocity
+	if (rule1num > 0) {
+		rule1pos /= rule1num;
+		rule1vel = (rule1pos - pos[index]) * rule1Scale;
+	}
+
+	// Velocity to keep distance from other boids
+	rule2vel *= rule2Scale;
+
+	// Calculating average velocity of neighbouring boids
+	if (rule3num > 0) {
+		rule3vel /= rule3num;
+		rule3vel *= rule3Scale;
+	}
+
+
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVelocity = vel1[index] + rule1vel + rule2vel + rule3vel;
+	if (glm::length(newVelocity) > maxSpeed) {
+		newVelocity = glm::normalize(newVelocity)*maxSpeed;
+	}
+	// - Clamp the speed change before putting the new speed in vel2
+	vel2[index] = newVelocity;
 }
 
 /**
@@ -349,39 +666,80 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
+	//checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+	//dev_vel1 = dev_vel2;
+	cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+	//cudaDeviceSynchronize();
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+	dim3 numOfGridBlocks((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <numOfGridBlocks, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <numOfGridBlocks, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize >>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
   // - Update positions
+	kernUpdatePos << < fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
   // - Ping-pong buffers as needed
+	cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+
+	kernReshuffle << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, coherent_pos, coherent_vel1);
+
+
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, coherent_pos, coherent_vel1, coherent_vel2);
+
+	// - Update positions
+	kernUpdatePos << < fullBlocksPerGrid, blockSize >> > (numObjects, dt, coherent_pos, coherent_vel2);
+
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	cudaMemcpy(dev_vel1, coherent_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+	cudaMemcpy(dev_pos, coherent_pos, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::endSimulation() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..ddd0e3b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..e6a1215 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -36,8 +36,9 @@ const float fovy = (float) (PI / 4);
 const float zNear = 0.10f;
 const float zFar = 10.0f;
 // LOOK-1.2: for high DPI displays, you may want to double these settings.
-int width = 1280;
-int height = 720;
+int highResolutionMultiplicationFactor = 1;
+int width = 1280 * highResolutionMultiplicationFactor;
+int height = 720 * highResolutionMultiplicationFactor;
 int pointSize = 2;
 
 // For camera controls