diff --git a/README.md b/README.md
index d63a6a1..57d690f 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,155 @@
+
+# Project 1 - Flocking
+
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Klayton Wittler
+	* [LinkedIn](https://www.linkedin.com/in/klayton-wittler/)
+* Tested on: Windows 10 Pro, i7-7700K @ 4.20GHz 16.0GB, GTX 1070 8.192GB (my PC)
+
+## Sections
+
+* [Introduction](#introduction-to-boid-flocking)
+	* [Rules](#rules)
+	* [Approaches](#approaches)
+* [Performance Analaysis](#performance-analysis)
+	* [Questions](#questions)
+* [Addition Optimization](#additional-optimization)
+
+![coherent 50k simulation](images/simulation_50k.gif)
+
+# Introduction to Boid Flocking
+This project implements Reynolds Boids algorithm using CUDA kernels along with several layers of neighbor search optimization: uniform grid and uniform grid with coherent memory access. 
+
+![naive screenshot with 5k boids](images/boid_screenshot_5k.png)
+
+## Rules
+The algorithm follows three rules for simulating boid flocking
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting to close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as their neighbors
+
+Psuedocode for the algorithm goes as follows.
+```
+function velocityChange(Boid boid)
+	vector cohesive_velocity = 0
+	vector avoidance_velocity = 0
+    vector alignement_velocity = 0
+
+    float cohesive_neighbors = 0
+    float alignment_neighbors = 0
+
+    for each Boid b to check
+        if b != boid then
+        	if distance(b, boid) < rule1Distance then
+            	cohesive_velocity += b.position
+            	++cohesive_neighbors
+        	end
+        	if distance(b, boid) < rule2Distance then
+            	avoidance_velocity += (boid.position - b.position)
+        	end
+        	if distance(b, boid) < rule3Distance then
+            	alignement_velocity += b.velocity
+            	++alignment_neighbors
+            end
+        end
+    end
+
+    if cohesive_neighbors > 0
+    	cohesive_velocity /= cohesive_neighbors
+    else
+    	cohesive_velocity = boid.position
+    endif
+    if alignment_neighbors > 0
+    	alignement_velocity /= alignment_neighbors
+    endif
+
+    vector rule1Change = (cohesive_velocity - boid.position) * rule1Scale
+    vector rule2Change = avoidance_velocity * rule2Scale
+    vector rule3Change = alignement_velocity * rule3Scale
+
+    return rule1Change + rule2Change + rule3Change
+end
+```
+
+## Approaches
+Each boid must search around itself and compute its velocity change depending on the boids that fall within each of the rule distances.
+
+### Naive
+The naive approach is for each boid to compare itself to everyother boid. This quickly gets computationally expensive as will be seen the [Performance Analysis](#performance-analysis) section.
+
+
+### Uniform Grid
+To speed up the search we can take advantage of a uniform spatial grid data structure. As seen in 2D below, the environment is divided into cells which each boid will be binned into during a preprocess step.
+![a uniform grid in 2D](images/Boids%20Ugrid%20base.png)
+
+In order to narrow the search we can just find the cells with a boids search radius and compare to each boid within those cells. For the 2D case where the cell width is twice the maximum rule distance, we just have to search 4 cells and in 3D this would be 8 cells. If the cell width was the maximum rule distance, we would have to search 9 cells in the 2D case and 27 cells in the 3D case. 
+
+![a uniform grid in 2D with neighborhood and cells to search for some particles shown](images/Boids%20Ugrid%20neighbor%20search%20shown.png)
+
+To keep track of cell membership we use an array of pointers that can be sorted on the key of grid cell index and the start and end of each cell in the array marked with pointers.
+
+![buffers for generating a uniform grid using index sort](images/Boids%20Ugrids%20buffers%20naive.png)
+
+### Coherent Memory
+We can further optimize this by also sorting the array that contains the position and velocity information to be coherent in memory so that as we look into a cell all its members are within some area.
+![buffers for generating a uniform grid using index sort, then making the boid data coherent](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+# Performance Analysis
+To do a performance analysis, vertical sync is turn off so that the frame rate won't be capped based on the monitor and the frames per second (fps) of a simulation is recorded.
+
+Initially visualization is turned off to get a better comparison of algorithm run times without display code running. The figure below displays all approaches including differing cell width of twice the maximum rule distance (8 cells) and at the maximum rule distance (27 cells). Cell width changes were not included for the naive approach since it does not divide the simulation into a grid. Below 20k boids the 27 cells search performs better than the 8 cells search and again becomes better above 100k boids. The sorting to do a coherent memory search seems to also have a fairly consistent increase in performance.
+
+![boids noVisual](images/num_boids_noVisual.png)
+
+However, watching the boids flock is the fun part. We can see the performance for all approaches in the figure below and see that the general trends are the same as without visuals, but with some perforance decrease from rendering.
+
+![boids visual](images/num_boids_visual.png)
+
+Similarly, we can see which block size give the best performance. The algorithms in the figure below were ran without visual to have the most consistent baseline and were tried just at 20k boids. It can be seen that there is negligible differences between the number of threads in a block. It should also be noted that only multiples of 32 were tried since the warp size is 32 and any other numbers would result in wasted threads.
+
+![blocks noVisual](images/block_size_noVisual.png)
+
+
+## Questions
+* <b>For each implementation, how does changing the number of boids affect
+performance? Why do you think this is? </b>
+
+For the naive case, the performance drops exponentially do to the ![](images/OnSquared.png) time complexity. Each boid as the check itself against evryone else. On the other hand, uniform grid and uniform grid with coherent memory searches performances drops in piecewise-linear segments. This is because the comparison to other cells is constant in respect to the number of boids but the checks into neighboring cells is still ![](images/OnSquared.png). As the number of boids increases the number of boids that may be in a cell increases, ultimately resulting in significant increases in computation.
+
+* <b>For each implementation, how does changing the block count and block size
+affect performance? Why do you think this is? </b>
+
+The performance by changing the number of threads in a block was fairly negligible across algorithms. This is because the block count is scaled with the number of threads to a block so that if the threads per block increase the number of blocks will decrease by the same factor and vice versa. Since only factors of 32 were tried, there were not wasted threads to impact performance.
+
+* <b>For the coherent uniform grid: did you experience any performance improvements
+with the more coherent uniform grid? Was this the outcome you expected?
+Why or why not? </b>
+
+There was a pretty constant increase in the performance impact from having a coherent grid. This was expected since sorting typically becomes worth it as number of look ups increases. The n boids constant time look up in particleArrayIndices to find their correct index is removed by a preprocess sorting step.
+
+* <b>Did changing cell width and checking 27 vs 8 neighboring cells affect performance?
+Why or why not? Be careful: it is insufficient (and possibly incorrect) to say
+that 27-cell is slower simply because there are more cells to check! </b>
+
+Suprisingly checking 27 cells versus 8 cells did not always result in worse performance. This is most likely due to the higher resolution that having 27 cells gives, leveraging the constant time of checking cells over boids, but as the number of boids increases past a threshold the resolution no longer provides a benefit since there is a high probably boids are in all of the cells nearby anyway.
+
+# Additional Optimization
+
+Beyond just doing a coherent uniform grid, there are some other ways to optimize performance.
+
+![coherent 200k simulation](images/simulation_200k.gif)
+
+Initially the uniform grid searches (coherent and not) were being looped through the z-axis first in the inner loop, y-axis next, and finally the x-axis. But since the objective of the coherent grid is to have nearby boids next to each other in memory this order is not the best since the index equation is ```x + y * gridResolution + z * gridResolution * gridResolution```. This lead to reversing the order, which the performance differences can be seen below, suprisingly at 20k and 50k boids this approach does not help. The comparison was not made on the normal uniform grid search, since there is no sorting in memory it would not have any impact.
+
+![](images/XYZ_ZYXloop.png)
+
+Fianlly, to further reduce the amount of cells to search I utilized the fact that a n-dimensional box can be described as ![](images/nDbox.png). Where b is half the side length, c is the center of the box, and x is the point in question. With this in mind we can take some cell in question, find the center c, increase the b that describes it by the maximum rule distance, and see if the boid x would now belong to the new cell ![](images/nDbox_new.png). If so that means there are boids within the original cell that might be close enough the current boid to have an effect on its velocity. If not, even though the cell is a neighbor it can be excluded from the search. Seen in case 1 and 2 below.
+
+![](images/optimization_example.png)
 
-### (TODO: Your README)
+This seemed great in theory, however the performance data below does not suggest this helps. This approach most likely beats comparing the boid to all 8 vertices of the cell, and there might be ways to speed up the geometrical comparison even more. There is a flag at the top of ```src/kernel.cu``` called ```OPTIMIZE_SEARCH``` that switches to it: 0 being normal and 1 being 'optimized' approach
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/optimized_noVisual.png)
diff --git a/images/OnSquared.png b/images/OnSquared.png
new file mode 100644
index 0000000..75fb15f
Binary files /dev/null and b/images/OnSquared.png differ
diff --git a/images/XYZ_ZYXloop.png b/images/XYZ_ZYXloop.png
new file mode 100644
index 0000000..0145569
Binary files /dev/null and b/images/XYZ_ZYXloop.png differ
diff --git a/images/block_size_noVisual.png b/images/block_size_noVisual.png
new file mode 100644
index 0000000..cc8658b
Binary files /dev/null and b/images/block_size_noVisual.png differ
diff --git a/images/boid_screenshot_5k.png b/images/boid_screenshot_5k.png
new file mode 100644
index 0000000..1c45af4
Binary files /dev/null and b/images/boid_screenshot_5k.png differ
diff --git a/images/boid_screenshot_5k_zoom.png b/images/boid_screenshot_5k_zoom.png
new file mode 100644
index 0000000..83b16ca
Binary files /dev/null and b/images/boid_screenshot_5k_zoom.png differ
diff --git a/images/data/dataCollection.gsheet b/images/data/dataCollection.gsheet
new file mode 100644
index 0000000..2850759
--- /dev/null
+++ b/images/data/dataCollection.gsheet
@@ -0,0 +1 @@
+{"url": "https://docs.google.com/open?id=126_yFb2Q7nsmhk6HTyH-JrK7M1p4PByM4bEGDVSkAHA", "doc_id": "126_yFb2Q7nsmhk6HTyH-JrK7M1p4PByM4bEGDVSkAHA", "email": "kwittler@seas.upenn.edu"}
\ No newline at end of file
diff --git a/images/data/gridOptimization.gslides b/images/data/gridOptimization.gslides
new file mode 100644
index 0000000..85756b6
--- /dev/null
+++ b/images/data/gridOptimization.gslides
@@ -0,0 +1 @@
+{"url": "https://docs.google.com/open?id=1dNgsvgqCpJx8zPR-pVvHuhkoRXrmJ7zo9qibJKh3glM", "doc_id": "1dNgsvgqCpJx8zPR-pVvHuhkoRXrmJ7zo9qibJKh3glM", "email": "kwittler@seas.upenn.edu"}
\ No newline at end of file
diff --git a/images/nDbox.png b/images/nDbox.png
new file mode 100644
index 0000000..3dd0d14
Binary files /dev/null and b/images/nDbox.png differ
diff --git a/images/nDbox_new.png b/images/nDbox_new.png
new file mode 100644
index 0000000..28108e6
Binary files /dev/null and b/images/nDbox_new.png differ
diff --git a/images/num_boids_combined.png b/images/num_boids_combined.png
new file mode 100644
index 0000000..3febeb1
Binary files /dev/null and b/images/num_boids_combined.png differ
diff --git a/images/num_boids_noVisual.png b/images/num_boids_noVisual.png
new file mode 100644
index 0000000..6f5e2c9
Binary files /dev/null and b/images/num_boids_noVisual.png differ
diff --git a/images/num_boids_visual.png b/images/num_boids_visual.png
new file mode 100644
index 0000000..ba6dcc1
Binary files /dev/null and b/images/num_boids_visual.png differ
diff --git a/images/optimization_example.png b/images/optimization_example.png
new file mode 100644
index 0000000..a335681
Binary files /dev/null and b/images/optimization_example.png differ
diff --git a/images/optimized_noVisual.png b/images/optimized_noVisual.png
new file mode 100644
index 0000000..096c79f
Binary files /dev/null and b/images/optimized_noVisual.png differ
diff --git a/images/simulation_100k.gif b/images/simulation_100k.gif
new file mode 100644
index 0000000..4cee5da
Binary files /dev/null and b/images/simulation_100k.gif differ
diff --git a/images/simulation_200k.gif b/images/simulation_200k.gif
new file mode 100644
index 0000000..9118f66
Binary files /dev/null and b/images/simulation_200k.gif differ
diff --git a/images/simulation_50k.gif b/images/simulation_50k.gif
new file mode 100644
index 0000000..efaacb8
Binary files /dev/null and b/images/simulation_50k.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..14bb503 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -54,6 +54,29 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
+/*****************************
+* Self defined configuration *
+******************************/
+
+//Will call a function to trim cells from search
+#define OPTIMIZE_SEARCH 0
+#define GLM_CLAMP 0
+//Change the cell size to be n times the max radius rule
+float nXradius = 1.0;
+
+struct DebugVector {
+	float x;
+	float y;
+	float z;
+};
+
+__device__ DebugVector debugVectorViewer(glm::vec3 v) {
+	return { v.x, v.y, v.z };
+}
+
+#define glmvec3(_name, _x,_y,_z)    glm::vec3 _name(_x,_y,_z); \
+									DebugVector _name##_ = debugVectorViewer(_name);
+
 /***********************************************
 * Kernel state (pointers are device pointers) *
 ***********************************************/
@@ -85,6 +108,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_posSorted;
+glm::vec3 *dev_velSorted;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -157,7 +182,7 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  gridCellWidth = nXradius * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -168,7 +193,32 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  // TODO-2.1 - Allocate additional buffers here.
+
+  // For efficient sorting and the uniform grid. These should always be parallel.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int)); // What index in dev_pos and dev_velX represents this particle?
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int)); // What grid cell is this particle in?
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+  // needed for use with thrust
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int)); // What part of dev_particleArrayIndices belongs
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));   // to this cell?
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  // TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_posSorted, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_velSorted, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -229,22 +279,67 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel, int start = 0, int *particleArrayIndices = NULL) {
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 iBoid = pos[iSelf];
+	glm::vec3 perceived_com(0.0f);
+	glm::vec3 perceived_dist(0.0f);
+	glm::vec3 perceived_vel(0.0f);
+	int num_com = 0;
+	int num_vel = 0;
+
+	for (int b = start; b < N; b++) {
+		int iCheck = particleArrayIndices == NULL ? b : particleArrayIndices[b];
+		if (iCheck != iSelf) {
+			glm::vec3 checkBoid = pos[iCheck];
+			float boidDistance = glm::distance(iBoid, checkBoid);
+			if (boidDistance < rule1Distance) {
+				perceived_com += checkBoid;
+				++num_com;
+			}
+			if (boidDistance < rule2Distance) {
+				perceived_dist -= (checkBoid - iBoid);
+			}
+			if (boidDistance < rule3Distance) {
+				perceived_vel += vel[iCheck];
+				++num_vel;
+			}
+		}
+	}
+
+	perceived_com = num_com > 0 ? perceived_com / (float)num_com : iBoid;
+	perceived_vel = num_vel > 0 ? perceived_vel / (float)num_vel : glm::vec3(0.0f);
+
+	glm::vec3 result1 = (perceived_com - iBoid) * rule1Scale;
+	glm::vec3 result2 = perceived_dist * rule2Scale;
+	glm::vec3 result3 = perceived_vel * rule3Scale;
+
+  return result1 + result2 + result3;
 }
 
 /**
 * TODO-1.2 implement basic flocking
-* For each of the `N` bodies, update its position based on its current velocity.
+* For each of the `N` bodies, update its velocity based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 *vel1, glm::vec3 *vel2) {
+	// Compute a new velocity based on pos and vel1
+	// Clamp the speed
+	// Record the new velocity into vel2. Question: why NOT vel1?
+		  // do not want to update velocities based on another boids new velocity so assign to different variables
+		 // synchronization
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 newVel = computeVelocityChange(N, index, pos, vel1) + vel1[index];
+#	if GLM_CLAMP
+	vel2[index] = glm::clamp(newVel, -glm::vec3(maxSpeed), glm::vec3(maxSpeed));
+	#else
+	vel2[index] = glm::length(newVel) > maxSpeed ? glm::normalize(newVel) * maxSpeed : newVel;
+	#endif
 }
 
 /**
@@ -289,6 +384,14 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return; 
+	}
+
+	glm::ivec3 gridIndex3D = glm::floor((pos[index] - gridMin)*inverseCellWidth);
+	gridIndices[index] = gridIndex3Dto1D(gridIndex3D.x, gridIndex3D.y, gridIndex3D.z, gridResolution);
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +409,78 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	if (index == 0 || particleGridIndices[index] != particleGridIndices[index-1]) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+	}
+	if (index == N-1 || particleGridIndices[index] != particleGridIndices[index + 1]) {
+		gridCellEndIndices[particleGridIndices[index]] = index+1;
+	}
+}
+
+__device__ bool boidNearCell(float cellWidth, int x, int y, int z, glm::vec3 gridMin, glm::vec3 boid, float radius) {
+	//Increases the cell size by radius size and checks if the boid is inside the 'new cell'
+	//returns if the cell should be considered in the search
+	//glm::vec3 cellCenter(((float)x)*1.5*cellWidth - gridMin.x, ((float)y)*1.5*cellWidth - gridMin.y, ((float)z)*1.5*cellWidth - gridMin.z);
+	glm::vec3 cellCenter(glm::vec3(((float)x)*1.5*cellWidth, ((float)y)*1.5*cellWidth, ((float)z)*1.5*cellWidth) - gridMin);
+	glm::vec3 B(radius + cellWidth / 2.0);
+	glm::vec3 newCenter = boid - cellCenter;
+
+	if (glm::all(glm::lessThanEqual(newCenter, B)) && glm::all(glm::greaterThanEqual(-B, newCenter))) { return true; }
+	
+	return false;
+}
+
+__device__ void kernSearch(int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2, int *particleArrayIndices = NULL ) {
+	//Search for both the uniform and coherent grids
+	//checks to see if particleArrayIndices is a valid input
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+	int iSelf = particleArrayIndices == NULL ? index : particleArrayIndices[index];;
+	glm::vec3 boid = pos[iSelf];
+	float radius = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+
+	glm::ivec3 gridIndex3D = glm::floor((boid - gridMin)*inverseCellWidth);
+	glm::vec3 min_grid3D = glm::floor((boid - gridMin - radius)*inverseCellWidth);
+	glm::vec3 max_grid3D = glm::ceil((boid - gridMin + radius)*inverseCellWidth);
+	glm::ivec3 min_gridIndex3D = glm::clamp(min_grid3D, glm::vec3(0), glm::vec3(gridResolution));
+	glm::ivec3 max_gridIndex3D = glm::clamp(max_grid3D, glm::vec3(0), glm::vec3(gridResolution));
+	int currGrid = gridIndex3Dto1D(gridIndex3D.x, gridIndex3D.y, gridIndex3D.z, gridResolution);
+	int numCells = gridResolution * gridResolution * gridResolution;
+
+	glm::vec3 result(0.0f);
+
+	for (int z = min_gridIndex3D.z; z <= max_gridIndex3D.z; z++) {
+		for (int y = min_gridIndex3D.y; y <= max_gridIndex3D.y; y++) {
+			for (int x = min_gridIndex3D.x; x <= max_gridIndex3D.x; x++) {
+				int c = gridIndex3Dto1D(x, y, z, gridResolution);
+				if (c > numCells || gridCellStartIndices[c] > N || gridCellEndIndices[c] > N) { continue; }
+				#if OPTIMIZE_SEARCH
+				if (c == currGrid || boidNearCell(cellWidth, x, y, z , gridMin, boid, radius)) {
+					result += computeVelocityChange(gridCellEndIndices[c], iSelf, pos, vel1, gridCellStartIndices[c], particleArrayIndices);
+				}
+				#else
+				result += computeVelocityChange(gridCellEndIndices[c], iSelf, pos, vel1, gridCellStartIndices[c], particleArrayIndices);
+				#endif
+			}
+		}
+	}
+
+	glm::vec3 newVel = result + vel1[iSelf];
+	#if GLM_CLAMP
+	vel2[iSelf] = glm::clamp(newVel, -glm::vec3(maxSpeed), glm::vec3(maxSpeed));
+	#else
+	vel2[iSelf] = glm::length(newVel) > maxSpeed ? glm::normalize(newVel) * maxSpeed : newVel;
+	#endif
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +497,16 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+	kernSearch(N, gridResolution, gridMin, inverseCellWidth, cellWidth, gridCellStartIndices, gridCellEndIndices, pos, vel1, vel2, particleArrayIndices);
+}
+
+__global__ void kernShuffleIndices(int N,int *particleArrayIndices, glm::vec3 *unsortedPos, glm::vec3 *unsortedVel, glm::vec3 *sortedPos, glm::vec3 *sortedVel) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	sortedPos[index] = unsortedPos[particleArrayIndices[index]];
+	sortedVel[index] = unsortedVel[particleArrayIndices[index]];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +526,7 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+	kernSearch(N, gridResolution, gridMin, inverseCellWidth, cellWidth, gridCellStartIndices, gridCellEndIndices, pos, vel1, vel2);
 }
 
 /**
@@ -348,40 +534,119 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+	kernUpdatePos<<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
   // TODO-1.2 ping-pong the velocity buffers
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+
+	kernComputeIndices<<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+	kernResetIntBuffer << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellStartIndices, numObjects + 10);
+	checkCUDAErrorWithLine("kernResetIntBuffer on dev_gridCellStartIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellEndIndices, numObjects + 10);
+	checkCUDAErrorWithLine("kernResetIntBuffer on dev_gridCellEndIndices failed!");
+
+	kernIdentifyCellStartEnd <<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
   // - Perform velocity updates using neighbor search
+
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
   // - Update positions
+
+	kernUpdatePos<<<fullBlocksPerGrid, threadsPerBlock >>>(numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
   // - Ping-pong buffers as needed
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+
+	kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+
+	kernResetIntBuffer << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellStartIndices, numObjects + 10);
+	checkCUDAErrorWithLine("kernResetIntBuffer on dev_gridCellStartIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_gridCellEndIndices, numObjects + 10);
+	checkCUDAErrorWithLine("kernResetIntBuffer on dev_gridCellEndIndices failed!");
+
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	kernShuffleIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_posSorted, dev_velSorted);
+
   // - Perform velocity updates using neighbor search
+
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_posSorted, dev_velSorted, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
   // - Update positions
+
+	kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dt, dev_posSorted, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
+
+	glm::vec3 *buffer = dev_pos;
+	dev_pos = dev_posSorted;
+	dev_posSorted = buffer;
+
 }
 
 void Boids::endSimulation() {
@@ -389,7 +654,16 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  // TODO-2.1 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  //TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_posSorted);
+  cudaFree(dev_velSorted);
+
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..52693aa 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,10 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+#define TIME 0
+
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -195,6 +197,15 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
+	#if TIME
+	cudaEvent_t event1, event2;
+	cudaEventCreate(&event1);
+	cudaEventCreate(&event2);
+
+	//record events around kernel launch
+	cudaEventRecord(event1, 0); //where 0 is the default stream
+	#endif
+
     // execute the kernel
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
@@ -203,6 +214,14 @@ void initShaders(GLuint * program) {
     #else
     Boids::stepSimulationNaive(DT);
     #endif
+	#if TIME
+	cudaEventRecord(event2, 0);
+	//calculate time
+	float dt_ms;
+	cudaEventElapsedTime(&dt_ms, event1, event2);
+
+	printf("Uniform grid time: %f ms \n", dt_ms);
+	#endif
 
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
@@ -217,7 +236,7 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+    //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
     while (!glfwWindowShouldClose(window)) {
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..728bf7b 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -39,6 +39,9 @@ const float zFar = 10.0f;
 int width = 1280;
 int height = 720;
 int pointSize = 2;
+/*int width = 2560;
+int height = 1440;
+int pointSize = 3;*/
 
 // For camera controls
 bool leftMousePressed = false;