diff --git a/README.md b/README.md
index d63a6a1..b05acd4 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,161 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Srinath Rajagopalan
+  * [LinkedIn](https://www.linkedin.com/in/srinath-rajagopalan-07a43155), [twitter](https://twitter.com/srinath132)
+* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Nvidia Quadro P1000 4GB (Moore 100B Lab)
+
+## Simulating Boids
+
+For this project, we will be simulating the motion of boids (objects representing birds/fish) by making them adhere to certain rules governing their velocity. The rules are 
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting to close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as
+their neighbors
+
+The final simulation using for `N=5000` boids and `N=20000` boids
+
+![5k](data/5k_boids.gif)
+
+![20k](data/20k_boids.gif)
+
+
+### Velocity Update
+
+Initially all the boids assume a random position in space. We create the simulation by calculating their velocities (derived from the 3 rules) and updating their positions. By introducing _parallelism_ we can enforce all the rules for the N boids with N threads where each thread computes the velocity update for the corresponding boid. The naive approach is for each boid to check the other N-1 boids, find who are all its neighbours, and compute the velocity update. This would be implemented as follows:
+
+#### Rule 1: Boids try to fly towards the centre of mass of neighbouring boids
+
+```
+function rule1(Boid boid)
+
+    Vector perceived_center
+    foreach Boid b:
+        if b != boid and distance(b, boid) < rule1Distance then
+            perceived_center += b.position
+        endif
+    end
+
+    perceived_center /= number_of_neighbors
+
+    return (perceived_center - boid.position) * rule1Scale
+end
+```
+
+#### Rule 2: Boids try to keep a small distance away from other objects (including other boids).
+
+```
+function rule2(Boid boid)
+
+    Vector c = 0
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule2Distance then
+            c -= (b.position - boid.position)
+        endif
+    end
+
+    return c * rule2Scale
+end
+```
+
+#### Rule 3: Boids try to match velocity with near boids.
+
+```
+function rule3(Boid boid)
+
+    Vector perceived_velocity
+    foreach Boid b
+        if b != boid and distance(b, boid) < rule3Distance then
+            perceived_velocity += b.velocity
+        endif
+    end
+
+    perceived_velocity /= number_of_neighbors
+
+    return perceived_velocity * rule3Scale
+end
+``` 
+
+However, this approach is extremely inefficient because each boid is performing N-1 checks when it would only have a small number of potential neighbours. So a more efficient approach is to do a pre-processing step by binning the boids in space with the help of a cell grid. Let's say the simulation space is a cube enclosing all the boids. Divide the cube into cells where each cell is of dimension `(gridCellWidth, gridCellWidth, gridCellWidth)`. Now if the cell width is `2 * neighbourDistance`, each boid has to only check the boids present in the 8 surrounding cells. This drastically reduces the search space. By controlling the cell width we can limit the number of neighbouring cells we have to check. The following grid illustrates this.
+
+![a uniform grid in 2D with neighborhood and cells to search for some particles shown](images/Boids%20Ugrid%20neighbor%20search%20shown.png)
+
+How to construct the uniform grid in parallel?
+
+## Unfiform Grid Cell construction
+
+```
+1. Construct an array of size N- boidIndices[0..N-1] where boidIndices[i] = i
+2. Construct an array of size N- gridIndices[0..N-1] where gridIndices[i] is the grid to which boid i belongs.
+3. Sort arrays boidIndices and gridIndices based on the value of gridIndices. Now gridIndices is in ascending order where gridIndices[i] is the grid to which the boid boidIndices[i] belongs.
+4. Construct two arrays of size gridCellCount each - gridCellStart and gridCellEnd - where gridCellStart[g] and gridCellEnd[g] are the range of positions (in the boidIndices/gridIndices arrays) representing the boids belonging to grid g. 
+```
+
+The above is represented by the following diagram:
+
+![buffers for generating a uniform grid using index sort](images/Boids%20Ugrids%20buffers%20naive.png)
+
+Once we have this, for each boid, we figure out the potential neighbour grid cells (which is 8 if gridCellWidth = 2 * neighbourDistance). Now with the uniform grid constructed as above, we can access all the boids present in neighbouring cells and compute the velocity update.
+
+## Coherent Search
+
+In the above uniform grid construction, we access the position and velocity of neighbouring boids by getting their index from `boidIndices` array using that to retrieve `pos[boidIndices[i]]` and `vel[boidIndices[i]]`. This is a _scattered_ search because we are continuously accessing position and velocity values from non-contiguous memory locations. If `pos` and `vel` _also_ algined with `gridIndices` and `boidIndices` we can directly access the positions of neighbouring boids in a grid cell with `gridCellStart` and `gridCellEnd`: `pos[gridCellStart[g]]` to `pos[gridCellEnd[g]]` gives the positions of all the potential boids. Therefore we also sort the position and velocity arrays by the value of `gridIndices`. But, since we have already sorted `boidIndices` using a simple shuffle trick we can rearrange `pos` data to reflect the sort. Now since the data is stored in contiguous locations in memory, we can get efficient memory access (blocks of data will be cached locally we might not have hit the global memory for every iteration). This is reflected in the following diagram,
+
+![buffers for generating a uniform grid using index sort, then making the boid data coherent](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+
+### Performance Analysis
+
+We measure the time (in milliseconds) taken for one complete simulation step using CUDA events. The analgous FPS is measured as `1000/(time_taken)`. The time taken is averaged over 1000 simulation steps. 
+
+1. How does changing the number of boids affect performance? Why do you think this is?
+
+<p align='center'>
+<img src="data/naive_boids.png">
+</p>
+
+<p align='center'>
+<img src="data/scattered_boids.png">
+</p>
+
+<p align='center'>
+<img src="data/coherent_boids.png">
+</p>
+
+In all cases, performance drops with increasing number of boids. This is expected, more boids mean more calculations we have to perform as our potential neighbours increase (since the volume is kept constant, density has to increase) and we have more boids to calculate the velocity update for. Theoretically, if we had very large number of processors to execute all the threads in a truly parallel way, only the dense neighbours affect the performance and not the total number of velocity updates. 
+
+2. For the coherent uniform grid: did you experience any performance improvements
+with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+
+From the graphs we see that, we get better performance for scattered search over naive search (unsurprisingly). The unfigorm grid construction drastically reduces the search space per boid (as explained in the before sections). Coherent search improves upon scattered search by utilizing efficient memory access. This is not apparent till we increase the number of boids to a large number. With Coherent, we are able to simulate upto 1M boids  but only 500k boids for the scattered search. Naive search fails beyond 50k boids.
+
+3. How does changing the block count and block size affect performance? Why do you think this is?
+
+The following graph is plotted for scattered search with 50,000 boids.
+
+<p align='center'>
+<img src="data/block_size.png">
+</p>
+
+Since we are using the formula `blockSize * numberOfBlocks = N`, increasing block size decreases number of blocks. So we only check for different block sizes (8, 128, 1024), the extreme cases to see if there is a trend. The performance impact is not noticeable after block size = 128. But for an absurdly small number of threads per block (N = 8), the performance dropped quite significantly. This is probably because of the overhead in launching and scheduling a large number of blocks as opposed to launching fewer blocks.
+
+4. Did changing cell width and checking 27 vs 8 neighboring cells affect performance?
+Why or why not?
+
+The following graph is plotted for scattered search with 50,000 boids.
+
+<p align='center'>
+<img src="data/cell_width.png">
+</p>
+
+As discussed before, the cell width controls the number of surrounding grid cells we have to search for. For the extreme cases, with a very large cell width, we check 1 cell but ALL the boids within that cell. If the cell width is small, we check MORE number of grid cells but fewer number boids within that cell. So we have to strike a trade-off. We see that cell width (as a function of neighbouring distance) of 2x is the sweet spot. 
+
+
+
+
+
+
+
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/data/20k_boids.gif b/data/20k_boids.gif
new file mode 100644
index 0000000..f87b5c3
Binary files /dev/null and b/data/20k_boids.gif differ
diff --git a/data/5k_boids.gif b/data/5k_boids.gif
new file mode 100644
index 0000000..55364d8
Binary files /dev/null and b/data/5k_boids.gif differ
diff --git a/data/block_size.png b/data/block_size.png
new file mode 100644
index 0000000..0c82617
Binary files /dev/null and b/data/block_size.png differ
diff --git a/data/cell_width.png b/data/cell_width.png
new file mode 100644
index 0000000..676b6ed
Binary files /dev/null and b/data/cell_width.png differ
diff --git a/data/coherent_boids.png b/data/coherent_boids.png
new file mode 100644
index 0000000..9107a62
Binary files /dev/null and b/data/coherent_boids.png differ
diff --git a/data/naive_boids.png b/data/naive_boids.png
new file mode 100644
index 0000000..05ca335
Binary files /dev/null and b/data/naive_boids.png differ
diff --git a/data/scattered_boids.png b/data/scattered_boids.png
new file mode 100644
index 0000000..7941edb
Binary files /dev/null and b/data/scattered_boids.png differ
diff --git a/perf_boids.xlsx b/perf_boids.xlsx
new file mode 100644
index 0000000..4dee1cb
Binary files /dev/null and b/perf_boids.xlsx differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..f721a9c 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_shuffledPos;
+glm::vec3 *dev_shuffledVel;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -94,6 +96,9 @@ float gridCellWidth;
 float gridInverseCellWidth;
 glm::vec3 gridMinimum;
 
+float avgTimeTaken = 0.0f;
+float countTimeTaken = 0;
+
 /******************
 * initSimulation *
 ******************/
@@ -157,7 +162,7 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  gridCellWidth = 2*std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,7 +174,29 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
+	cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+	
+	cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+	
+	cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+	
+	cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	dev_thrust_particleArrayIndices = thrust::device_pointer_cast(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_pointer_cast(dev_particleGridIndices);
+
+	cudaMalloc((void**)&dev_shuffledPos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_shuffledVel, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+  
+	cudaDeviceSynchronize();
 }
 
 
@@ -230,21 +257,70 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+	
+	// Rule 1
+	glm::vec3 finalDir(0.0f, 0.0f, 0.0f);
+	int numOfNeighbours = 0;
+
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) < rule1Distance) {
+			finalDir += pos[i];
+			numOfNeighbours += 1;
+		}
+	}
+
+	if (numOfNeighbours != 0) {
+		finalDir /= numOfNeighbours;
+		finalDir = (finalDir - pos[iSelf]) * rule1Scale;
+	}
+
+	// Rule 2
+	glm::vec3 finalDir2(0.0f, 0.0f, 0.0f);
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) < rule2Distance) {
+			finalDir2 -= (pos[i] - pos[iSelf]);
+		}
+	}
+
+	finalDir2 = finalDir2 * rule2Scale;
+
+	glm::vec3 finalVel(0.0f, 0.0f, 0.0f); 
+	numOfNeighbours = 0;
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) < rule3Distance) {
+			finalVel += vel[i];
+			numOfNeighbours += 1;
+		}
+	}
+
+	if (numOfNeighbours != 0) {
+		finalVel /= numOfNeighbours;
+		finalVel = finalVel * rule3Scale;
+	}
+
+	finalVel = finalVel + finalDir + finalDir2;
+
+  return finalVel;
 }
 
 /**
 * TODO-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
-__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
+__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+
+	int boidId = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (boidId >= N) return;
+
+	glm::vec3 finalVel = computeVelocityChange(N, boidId, pos, vel1);
+	vel2[boidId] = glm::clamp(vel1[boidId] + finalVel, -maxSpeed, maxSpeed);
 }
 
 /**
@@ -285,10 +361,21 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  // TODO-2.1
+  //-Label each boid with the index of its grid cell.
+  //-Set up a parallel array of integer indices as pointers to the actual
+  // boid data in pos and vel1/vel2
+
+	int boidId = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (boidId >= N) return;
+
+	glm::vec3 grid3d = glm::floor((pos[boidId] - gridMin) * inverseCellWidth);
+
+	int gridId = gridIndex3Dto1D((int) grid3d.x, (int) grid3d.y, (int) grid3d.z, gridResolution);
+
+	indices[boidId] = boidId;
+	gridIndices[boidId] = gridId;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -301,13 +388,30 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	//float x = p[index].x, y = p[index].y, z = p[index].z;
+
+	if(index != 0 && particleGridIndices[index] != particleGridIndices[index-1]) {
+		gridCellEndIndices[particleGridIndices[index-1]] = index-1;
+		gridCellStartIndices[particleGridIndices[index]] = index;
+	}
+	else if (index == 0) {
+		gridCellStartIndices[particleGridIndices[0]] = 0;
+	}
+	else if (index == N - 1) {
+		gridCellEndIndices[particleGridIndices[N - 1]] = N - 1;
+	}
+
 }
 
+
 __global__ void kernUpdateVelNeighborSearchScattered(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
@@ -322,8 +426,83 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+	int iSelf = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (iSelf >= N) return;
+
+	int countN1 = 0, countN3 = 0;
+
+	glm::vec3 finalVel(0.0f, 0.0f, 0.0f);
+	glm::vec3 finalDir(0.0f, 0.0f, 0.0f);
+	glm::vec3 finalCenter(0.0f, 0.0f, 0.0f);
+
+	float maxDistance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+	glm::vec3 grid3d = glm::floor((pos[iSelf] - gridMin) * inverseCellWidth);
+	glm::vec3 grid3d_min = glm::floor((pos[iSelf] - gridMin - maxDistance) * inverseCellWidth);
+	glm::vec3 grid3d_max = glm::floor((pos[iSelf] - gridMin + maxDistance) * inverseCellWidth);
+
+	int x_g = (int)grid3d.x, y_g = (int)grid3d.y, z_g = (int)grid3d.z;
+	int x_g_min = (int)grid3d_min.x, y_g_min = (int)grid3d_min.y, z_g_min = (int)grid3d_min.z;
+	int x_g_max = (int)grid3d_max.x, y_g_max = (int)grid3d_max.y, z_g_max = (int)grid3d_max.z;
+
+	x_g_max = imin(x_g_max, gridResolution - 1), y_g_max = imin(y_g_max, gridResolution - 1), z_g_max = imin(z_g_max, gridResolution - 1);
+	x_g_min = imax(x_g_min, 0), y_g_min = imax(y_g_min, 0), z_g_min = imax(z_g_min, 0);
+
+	//x_g_max = x_g-1, y_g_max = x_g-1, z_g_max = imin(z_g_max, gridResolution - 1);
+	//x_g_min = imax(x_g_min, 0), y_g_min = imax(y_g_min, 0), z_g_min = imax(z_g_min, 0);
+
+
+	for (int x = x_g_min; x <= x_g_max; x++) {
+		for (int y = y_g_min; y <= y_g_max; y++) {
+			for (int z = z_g_min; z <= z_g_max; z++) {
+				if (x >= 0 && y >= 0 && z >= 0 && x < gridResolution && y < gridResolution && z < gridResolution) {
+					int gridId = gridIndex3Dto1D(x, y, z, gridResolution);
+					
+					int start = gridCellStartIndices[gridId];
+					int end = gridCellEndIndices[gridId];
+					if (start < 0 || start >= N || end < 0 || end >= N) continue;
+
+					for (int idx = start; idx <= end; idx++) {
+						int bid = particleArrayIndices[idx];
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule1Distance) {
+							finalCenter += pos[bid];
+							countN1 += 1;
+						}
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule2Distance) {
+							finalDir -= (pos[bid] - pos[iSelf]);
+						}
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule3Distance) {
+							finalVel += vel1[bid];
+							countN3 += 1;
+						}
+					}
+				}
+			}
+		}
+	}
+
+	if (countN1 != 0) {
+		finalCenter /= countN1;
+		finalCenter = (finalCenter - pos[iSelf]) * rule1Scale;
+	}
+
+	finalDir = finalDir * rule2Scale;
+
+	if (countN3 != 0) {
+		finalVel /= countN3;
+		finalVel = finalVel * rule3Scale;
+	}
+
+	glm::vec3 resultVel = finalCenter + finalDir + finalVel;
+	vel2[iSelf] = glm::clamp(vel1[iSelf] + resultVel, -maxSpeed, maxSpeed);
 }
 
+
+
 __global__ void kernUpdateVelNeighborSearchCoherent(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
@@ -341,16 +520,134 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+	int iSelf = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (iSelf >= N) return;
+
+	int countN1 = 0, countN3 = 0;
+
+	glm::vec3 finalVel(0.0f, 0.0f, 0.0f);
+	glm::vec3 finalDir(0.0f, 0.0f, 0.0f);
+	glm::vec3 finalCenter(0.0f, 0.0f, 0.0f);
+
+	float maxDistance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+	
+	glm::vec3 grid3d = glm::floor((pos[iSelf] - gridMin) * inverseCellWidth); // cell coordinates of the boid
+	glm::vec3 grid3d_min = glm::floor((pos[iSelf] - gridMin - maxDistance) * inverseCellWidth);
+	glm::vec3 grid3d_max = glm::floor((pos[iSelf] - gridMin + maxDistance) * inverseCellWidth);
+
+	int x_g = (int)grid3d.x, y_g = (int)grid3d.y, z_g = (int)grid3d.z;
+	int x_g_min = (int)grid3d_min.x, y_g_min = (int)grid3d_min.y, z_g_min = (int)grid3d_min.z;
+	int x_g_max = (int)grid3d_max.x, y_g_max = (int)grid3d_max.y, z_g_max = (int)grid3d_max.z;
+
+	x_g_max = imin(x_g_max, gridResolution - 1), y_g_max = imin(y_g_max, gridResolution - 1), z_g_max = imin(z_g_max, gridResolution - 1);
+	x_g_min = imax(x_g_min, 0), y_g_min = imax(y_g_min, 0), z_g_min = imax(z_g_min, 0);
+
+	for (int x = x_g_min; x <= x_g_max; x++) {
+		for (int y = y_g_min; y <= y_g_max; y++) {
+			for (int z = z_g_min; z <= z_g_max; z++) {
+				if (x >= 0 && y >= 0 && z >= 0 && x < gridResolution && y < gridResolution && z < gridResolution) {
+					int gridId = gridIndex3Dto1D(x, y, z, gridResolution);
+
+					int start = gridCellStartIndices[gridId];
+					int end = gridCellEndIndices[gridId];
+					if (start < 0 || start >= N || end < 0 || end >= N) continue;
+
+					for (int idx = start; idx <= end; idx++) {
+						int bid = idx;
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule1Distance) {
+							finalCenter += pos[bid];
+							countN1 += 1;
+						}
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule2Distance) {
+							finalDir -= (pos[bid] - pos[iSelf]);
+						}
+
+						if (bid != iSelf && glm::distance(pos[iSelf], pos[bid]) < rule3Distance) {
+							finalVel += vel1[bid];
+							countN3 += 1;
+						}
+					}
+				}
+			}
+		}
+	}
+
+	if (countN1 != 0) {
+		finalCenter /= countN1;
+		finalCenter = (finalCenter - pos[iSelf]) * rule1Scale;
+	}
+
+	finalDir = finalDir * rule2Scale;
+
+	if (countN3 != 0) {
+		finalVel /= countN3;
+		finalVel = finalVel * rule3Scale;
+	}
+
+	glm::vec3 resultVel = finalCenter + finalDir + finalVel;
+	vel2[iSelf] = glm::clamp(vel1[iSelf] + resultVel, -maxSpeed, maxSpeed);
+}
+
+
+__global__ void kernShufflePosAndVel(int N, glm::vec3 *pos, glm::vec3 *vel, glm::vec3 *shuffledPos, 
+		glm::vec3 *shuffledVel, int *particleArrayIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	shuffledPos[index] = pos[particleArrayIndices[index]]; // now shuffledPos[index] is the position of boid at particleArrayIndices[i]
+	shuffledVel[index] = vel[particleArrayIndices[index]];
 }
 
+__global__ void kernUnShuffleVel(int N, glm::vec3 *vel ,glm::vec3 *shuffledVel, int *particleArrayIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+
+	vel[particleArrayIndices[index]] = shuffledVel[index];
+}
+
+
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	cudaEventRecord(start);
+
+	dim3 numBlocks((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce<<<numBlocks, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+
+	checkCUDAErrorWithLine("Velocity update failed");
+
+	kernUpdatePos<<<numBlocks, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+	checkCUDAErrorWithLine("Position update failed");
+
+	dev_vel1 = dev_vel2;
+
+	cudaEventRecord(stop);
+	cudaEventSynchronize(stop);
+	float milliseconds = 0;
+	cudaEventElapsedTime(&milliseconds, start, stop);
+	avgTimeTaken += milliseconds;
+	if (countTimeTaken == 100){
+		avgTimeTaken /= countTimeTaken;
+		printf("Time taken for one simulation step: %.2f\n", avgTimeTaken);
+	}
+	countTimeTaken++;
 }
 
+
+
+
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
@@ -364,6 +661,54 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+	dim3 numBlocks((numObjects + blockSize - 1) / blockSize);
+	int N = numObjects;
+
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	cudaEventRecord(start);
+
+	//int threadsPerBlockCP = 2 * blockSize;
+	//dim3 numBlocksCellPointers((gridCellCount + threadsPerBlockCP - 1) / threadsPerBlockCP);
+
+	kernResetIntBuffer << <numBlocks, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	//kernResetIntBuffer<<<numBlocksCellPointers, threadsPerBlockCP>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("Cell start reset failed");
+	kernResetIntBuffer << <numBlocks, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	//kernResetIntBuffer<<<numBlocksCellPointers, threadsPerBlockCP >>>(gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("Cell end reset failed");
+
+	kernComputeIndices<<<numBlocks, blockSize>>>(N, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("Compute indices failed");
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + N, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd<<<numBlocks, blockSize>>>(N, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("Cell start/end failed");
+
+	kernUpdateVelNeighborSearchScattered<<<numBlocks, blockSize>>>(N, gridSideCount, gridMinimum, 
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("Velocity update failed");
+
+	kernUpdatePos<<<numBlocks, blockSize >>> (numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("Position update failed");
+
+	dev_vel1 = dev_vel2;
+
+	cudaEventRecord(stop);
+	cudaEventSynchronize(stop);
+	float milliseconds = 0;
+	cudaEventElapsedTime(&milliseconds, start, stop);
+	avgTimeTaken += milliseconds;
+	if (countTimeTaken == 1000) {
+		avgTimeTaken /= countTimeTaken;
+		printf("Time taken for one simulation step: %.2f\n", avgTimeTaken);
+	}
+	countTimeTaken++;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +727,57 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	cudaEventRecord(start);
+
+	dim3 numBlocks((numObjects + blockSize - 1) / blockSize);
+	int N = numObjects;
+
+	//int threadsPerBlockCP = 2 * blockSize;
+	//dim3 numBlocksCellPointers((gridCellCount + threadsPerBlockCP - 1) / threadsPerBlockCP);
+
+	kernResetIntBuffer << <numBlocks, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("Cell start reset failed");
+	kernResetIntBuffer << <numBlocks, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("Cell end reset failed");
+
+	kernComputeIndices << <numBlocks, blockSize >> > (N, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("Compute indices failed");
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + N, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd << <numBlocks, blockSize >> > (N, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("Cell start/end failed");
+
+	// shuffle position and velocity
+	kernShufflePosAndVel<< <numBlocks, blockSize >> > (N, dev_pos, dev_vel1, dev_shuffledPos, dev_shuffledVel, dev_particleArrayIndices);
+
+	kernUpdateVelNeighborSearchCoherent<< <numBlocks, blockSize >> > (N, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices, dev_shuffledPos, dev_shuffledVel, dev_vel2);
+	checkCUDAErrorWithLine("Velocity update failed");
+
+	// unshuffle velocity and put it back in dev_vel1
+	kernUnShuffleVel<< <numBlocks, blockSize>>>(N, dev_vel1, dev_vel2, dev_particleArrayIndices);
+
+	kernUpdatePos << <numBlocks, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("Position update failed");
+
+	cudaEventRecord(stop);
+	cudaEventSynchronize(stop);
+	float milliseconds = 0;
+	cudaEventElapsedTime(&milliseconds, start, stop);
+	avgTimeTaken += milliseconds;
+	if (countTimeTaken == 1000) {
+		avgTimeTaken /= countTimeTaken;
+		printf("Time taken for one simulation step: %.2f\n", avgTimeTaken);
+	}
+	countTimeTaken += 1;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +786,13 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+
+	cudaFree(dev_shuffledPos);
+	cudaFree(dev_shuffledVel);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..fafcc7b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**