diff --git a/README.md b/README.md
index 41b91f0..15429f8 100644
--- a/README.md
+++ b/README.md
@@ -5,17 +5,84 @@ CUDA Rasterizer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yichen Shou
+  * [LinkedIn](https://www.linkedin.com/in/yichen-shou-68023455/), [personal website](http://www.yichenshou.com/)
+* Tested on: Windows 10, i7-2600KU @ 3.40GHz 16GB RAM, NVIDIA GeForce GTX 660Ti 8GB (Personal Desktop)
 
-### (TODO: Your README)
+## Project Overview
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+This project implements a rasterizer on the GPU using OpenGL and CUDA. Various steps of the graphics pipeline, such as Vertex Assembly, Primitive Assembly, Rasterization and Fragment shading, uses the GPU to speed up efficiency. 
 
+Diffuse Cow                |  Cow with Normals
+:-------------------------:|:-------------------------:
+![](renders/CowRegular.PNG)  |  ![](renders/CowNormal.PNG)
+
+Diffuse Cube               |  Cube with Normals
+:-------------------------:|:-------------------------:
+![](renders/CubeRegular.PNG)  |  ![](renders/CubeNormal.PNG)
+
+[Video Demo](https://youtu.be/kjadBVjjtRc)
+
+## Tile-based Rendering
+
+This project also implements tile-based rendering to further speed up computation time. Tile-based rendering is a rendering technique  that divides the view port into smaller "tiles" to be rendered individually and then merged back into the full image. Since each tile is relatively small, the GPU can fully take advantage of cache locality/on-chip memory to drastically reduce memory access time and thus speed up rendering. This technique is widely used in GPU everywhere, although it is important for mobile GPUs that doesn't have as much global memory/speed to throw around as desktop GPUs. 
+
+For example: here's what the underlying tile system using 16x16 pixel tiles looks like:
+![](renders/tiles16x16.PNG)
+
+There are many ways of implementing tile-based rendering on the GPU. I implemented 2. One way is to parallize kernel calls over tiles and loop through every triangle in the tile in the kernel. One kernel is called and each thread handles one tile. Another way is to iteratively call kernels on every tile, parallizing over the triangles in that tiles. The Kernel is called as many times as there are tiles and each thread handles one triangle in one tile. I implemented both of these methods. The second method uses shared memory on the GPU to speed up computation. 
+
+## Performance Test
+
+First I compared the FPS of all 3 methods (regular, tile-parallized, triangle-parallized) on a single Triangle, rendered at two different distances from the camera.
+
+### Single Triangle
+
+Far Triangle               |  Close triangle
+:-------------------------:|:-------------------------:
+![](renders/TriangleFar.PNG)  |  ![](renders/TriangleClose.PNG)
+
+![](renders/TriangleChart.PNG)
+
+While all 3 methods renders the far away triangle well, only the tile-parallized method did not suffer from a huge speed decrease when the triangle is close. This is likely due to the fact that method 1 and 3 parallizes over the triangle while method 2 parallizes over the tile. Even though the triangle is taking over a lot of screen space, method 1 still uses only 1 thread to process everything, while method 2 uses as many threads as the number of tiles overlapped by this triangle. I'm unclear why method 3 is just as slow though, since it's supposed to parallize over triangles per tile so it should be just as fast as method 2 if not faster.
+
+### Cube
+
+Far Cube                   |  Close Cube
+:-------------------------:|:-------------------------:
+![](renders/CubeFar.PNG)   |  ![](renders/CubeClose.PNG)
+
+![](renders/CubeChart.PNG)
+
+Next I performed the same test on a simple 6 sides cube (12 triangles). The results are pretty much the same as the last test. When individual triangles take up a lot of screen space, method 1 always triumphs.
+
+### Cow
+
+Far Cow                  |  Close Cow
+:-----------------------:|:-----------------------:
+![](renders/CowFar.PNG)  |  ![](renders/CowClose.PNG)
+
+![](renders/CowChart.PNG)
+
+The final test is done on the cow model, and here method 1 starts to show its weakness. When the model is far and triangles are squished up into a small amount of tiles, method 1 is significantly slower than method 1, which launches a thread for every triangle. The performance is a little better when the cow is closer, but still not enough.
+
+The tests reveal clearly that regular rendering is great for rendering scenes with a large amount of triangles taking up screen space. When models are up close and a small amount of triangles are taking up the whole screen, tile-based rendering is better. Perhaps a heuristic can be used at the beginning of every frame to determine which rendering method would be better.
+
+Method 2 is supposed to be the best of both worlds, seeing how it launches a thread per triangle per tile, but it's performing rather poorly in all cases. I think there might be something wrong with my implementation. 
+
+### Pixel Size 
+
+![](renders/TileSizeChart.PNG)
+
+Lastly I compared the render time per frame (averged over 100 frames) of different pixel sizes on the close cube render. Lower is better in the graph. Without a doubt, the smaller sized tiles won, because smaller tiles = more number of tiles = more threads. I do think that eventually smaller tile sizes would run into trouble when memory/overhead is more limited. But it wasn't a problem at all on my 16 Gigs of GPU RAM.  
 
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+* CIS 565 class slides
+
+### Bloopers
+
+When you mess up near/far planes and accidentally summon the cow of nightmares
+![](renders/nightmareCow.gif)
\ No newline at end of file
diff --git a/renders/CowChart.PNG b/renders/CowChart.PNG
new file mode 100644
index 0000000..02fee78
Binary files /dev/null and b/renders/CowChart.PNG differ
diff --git a/renders/CowClose.PNG b/renders/CowClose.PNG
new file mode 100644
index 0000000..7a29a9f
Binary files /dev/null and b/renders/CowClose.PNG differ
diff --git a/renders/CowFar.PNG b/renders/CowFar.PNG
new file mode 100644
index 0000000..c919c65
Binary files /dev/null and b/renders/CowFar.PNG differ
diff --git a/renders/CowNormal.PNG b/renders/CowNormal.PNG
new file mode 100644
index 0000000..179f1a9
Binary files /dev/null and b/renders/CowNormal.PNG differ
diff --git a/renders/CowRegular.PNG b/renders/CowRegular.PNG
new file mode 100644
index 0000000..51d71b8
Binary files /dev/null and b/renders/CowRegular.PNG differ
diff --git a/renders/CubeChart.PNG b/renders/CubeChart.PNG
new file mode 100644
index 0000000..75c9e70
Binary files /dev/null and b/renders/CubeChart.PNG differ
diff --git a/renders/CubeClose.PNG b/renders/CubeClose.PNG
new file mode 100644
index 0000000..11d748c
Binary files /dev/null and b/renders/CubeClose.PNG differ
diff --git a/renders/CubeNormal.PNG b/renders/CubeNormal.PNG
new file mode 100644
index 0000000..20ba663
Binary files /dev/null and b/renders/CubeNormal.PNG differ
diff --git a/renders/CubeRegular.PNG b/renders/CubeRegular.PNG
new file mode 100644
index 0000000..a026cd3
Binary files /dev/null and b/renders/CubeRegular.PNG differ
diff --git a/renders/TileSizeChart.PNG b/renders/TileSizeChart.PNG
new file mode 100644
index 0000000..c36f671
Binary files /dev/null and b/renders/TileSizeChart.PNG differ
diff --git a/renders/TriangleChart.PNG b/renders/TriangleChart.PNG
new file mode 100644
index 0000000..c659093
Binary files /dev/null and b/renders/TriangleChart.PNG differ
diff --git a/renders/TriangleClose.PNG b/renders/TriangleClose.PNG
new file mode 100644
index 0000000..3f96ec9
Binary files /dev/null and b/renders/TriangleClose.PNG differ
diff --git a/renders/TriangleFar.PNG b/renders/TriangleFar.PNG
new file mode 100644
index 0000000..11e5fbb
Binary files /dev/null and b/renders/TriangleFar.PNG differ
diff --git a/renders/cubeFar.PNG b/renders/cubeFar.PNG
new file mode 100644
index 0000000..ba69baa
Binary files /dev/null and b/renders/cubeFar.PNG differ
diff --git a/renders/nightmareCow.gif b/renders/nightmareCow.gif
new file mode 100644
index 0000000..0220b4d
Binary files /dev/null and b/renders/nightmareCow.gif differ
diff --git a/renders/tiles.PNG b/renders/tiles.PNG
new file mode 100644
index 0000000..5699e59
Binary files /dev/null and b/renders/tiles.PNG differ
diff --git a/renders/tiles16x16.PNG b/renders/tiles16x16.PNG
new file mode 100644
index 0000000..31c448a
Binary files /dev/null and b/renders/tiles16x16.PNG differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..40c13cb 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_30
     )
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..acee160 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -9,6 +9,7 @@
 
 
 #include "main.hpp"
+#include <chrono>
 
 #define STB_IMAGE_IMPLEMENTATION
 #define TINYGLTF_LOADER_IMPLEMENTATION
@@ -50,10 +51,11 @@ int main(int argc, char **argv) {
 	}
 
 
-    frame = 0;
+    frame = 1;
     seconds = time(NULL);
     fpstracker = 0;
 
+
     // Launch CUDA/GL
     if (init(scene)) {
         // GLFW main loop
@@ -97,7 +99,7 @@ void mainLoop() {
 //---------RUNTIME STUFF---------
 //-------------------------------
 float scale = 1.0f;
-float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
+float x_trans = 0.0f, y_trans = 0.0f, z_trans = -1.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
@@ -120,8 +122,22 @@ void runCuda() {
 	glm::mat4 MVP = P * MV;
 
     cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
-    cudaGLUnmapBufferObject(pbo);
+
+	std::chrono::high_resolution_clock::time_point startTime = std::chrono::high_resolution_clock::now();
+
+	rasterize(dptr, MVP, MV, MV_normal, renderMode);
+	
+	std::chrono::high_resolution_clock::time_point endTime = std::chrono::high_resolution_clock::now();
+	std::chrono::duration<double, std::milli> duro = endTime - startTime;
+	float elapsedTime = static_cast<decltype(elapsedTime)>(duro.count());
+	
+	avgFrameTime += elapsedTime;
+	if (frame % 100 == 0) {
+		printf("100 frames took avg %f milliseconds\n", avgFrameTime / 100);
+		avgFrameTime = 0;
+	}	
+	
+	cudaGLUnmapBufferObject(pbo);
 
     frame++;
     fpstracker++;
@@ -183,6 +199,8 @@ bool init(const tinygltf::Scene & scene) {
 
 	rasterizeSetBuffers(scene);
 
+	rasterizeSetTileBuffers();
+
     GLuint passthroughProgram;
     passthroughProgram = initShader();
 
@@ -214,7 +232,7 @@ void initCuda() {
     // Use device with highest Gflops/s
     cudaGLSetGLDevice(0);
 
-    rasterizeInit(width, height);
+    rasterizeInit(width, height, tilePixelSize);
 
     // Clean up on program exit
     atexit(cleanupCuda);
@@ -395,6 +413,6 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
-	const double s = 1.0;	// sensitivity
+	const double s = 0.1;	// sensitivity
 	z_trans += (float)(s * yoffset);
 }
diff --git a/src/main.hpp b/src/main.hpp
index 4816fa1..65e874b 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -50,6 +50,11 @@ GLFWwindow *window;
 int width = 800;
 int height = 800;
 
+int renderMode = 1; // 0 for regular, 1 for tile-based rendering parallizing on tiles, 2 for tile-based rendering parallizing primitives
+int tilePixelSize = 8;
+
+float avgFrameTime;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..92cf700 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -43,7 +43,7 @@ namespace {
 
 		 glm::vec3 eyePos;	// eye space position used for shading
 		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
+		 glm::vec3 col;
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
 		// int texWidth, texHeight;
@@ -62,8 +62,9 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		glm::vec3 eyeLightDir;
 		// VertexAttributeTexcoord texcoord0;
 		// TextureData* dev_diffuseTex;
 		// ...
@@ -83,6 +84,7 @@ namespace {
 		VertexAttributeTexcoord* dev_texcoord0;
 
 		// Materials, add more attributes when needed
+		glm::vec4 dev_materialColor;
 		TextureData* dev_diffuseTex;
 		int diffuseTexWidth;
 		int diffuseTexHeight;
@@ -104,11 +106,20 @@ static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2Primitiv
 static int width = 0;
 static int height = 0;
 
+static int tilePixelSize = 16;
+static int tileWidth = 0;
+static int tileHeight = 0;
+
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
+static int *dev_tilePrimitives = NULL;
+static unsigned int *dev_primitiveIdxPerTile = NULL;
+static unsigned int *hst_primitiveIdxPerTile = NULL;
+static unsigned int maxPrimitivesPerTile = 512;
+
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
 /**
@@ -143,19 +154,28 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int index = x + (y * w);
 
     if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+		float diffuseFactor = fmaxf(0.0f, glm::dot(fragmentBuffer[index].eyeNor, fragmentBuffer[index].eyeLightDir));
+		framebuffer[index] = fragmentBuffer[index].color * diffuseFactor;
+		//framebuffer[index] = fragmentBuffer[index].color;
+		//framebuffer[index] = glm::normalize(fragmentBuffer[index].eyeNor);
 
 		// TODO: add your fragment shader code here
+		// compute diffuse color using Blinn or BlinnPhong
+		// store into fragment
 
+		// maybe do depth/stencil/scissor tests here too
     }
 }
 
 /**
  * Called once at the beginning of the program to allocate memory.
  */
-void rasterizeInit(int w, int h) {
+void rasterizeInit(int w, int h, int tilePixels) {
     width = w;
     height = h;
+	tilePixelSize = tilePixels;
+	tileWidth = (int)ceilf((float)w / tilePixels);
+	tileHeight = (int)ceilf((float)h / tilePixels);
 	cudaFree(dev_fragmentBuffer);
 	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
@@ -520,6 +540,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 					// You can only worry about this part once you started to 
 					// implement textures for your rasterizer
+					glm::vec4 dev_materialColor = glm::vec4(0, 0, 0, 1);
 					TextureData* dev_diffuseTex = NULL;
 					int diffuseTexWidth = 0;
 					int diffuseTexHeight = 0;
@@ -529,6 +550,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 						if (mat.values.find("diffuse") != mat.values.end()) {
 							std::string diffuseTexName = mat.values.at("diffuse").string_value;
+							dev_materialColor = glm::vec4(mat.values.at("diffuse").number_array[0],
+														  mat.values.at("diffuse").number_array[1],
+														  mat.values.at("diffuse").number_array[2],
+														  mat.values.at("diffuse").number_array[3]);
 							if (scene.textures.find(diffuseTexName) != scene.textures.end()) {
 								const tinygltf::Texture &tex = scene.textures.at(diffuseTexName);
 								if (scene.images.find(tex.source) != scene.images.end()) {
@@ -579,6 +604,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						dev_normal,
 						dev_texcoord0,
 
+						dev_materialColor,
 						dev_diffuseTex,
 						diffuseTexWidth,
 						diffuseTexHeight,
@@ -622,6 +648,17 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 }
 
 
+/**
+	Allocates space for tilePrimitiveBuffers
+*/
+void rasterizeSetTileBuffers(){
+	maxPrimitivesPerTile = totalNumPrimitives; // TODO: make this smaller
+	cudaMalloc(&dev_tilePrimitives, tileWidth * tileHeight * maxPrimitivesPerTile * sizeof(int));
+	cudaMalloc(&dev_primitiveIdxPerTile, tileWidth * tileHeight * sizeof(unsigned int));
+	hst_primitiveIdxPerTile = (unsigned int*)malloc(tileWidth * tileHeight * sizeof(unsigned int));
+}
+
+
 
 __global__ 
 void _vertexTransformAndAssembly(
@@ -634,14 +671,35 @@ void _vertexTransformAndAssembly(
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
 	if (vid < numVertices) {
 
-		// TODO: Apply vertex transformation here
+		// Vertex transformations
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
+		glm::vec4 vertexPos = glm::vec4(primitive.dev_position[vid], 1.0f);
+		glm::vec3 vertexNorm = primitive.dev_normal[vid];
+		glm::vec4 clipPos = MVP * vertexPos;
 		// Then divide the pos by its w element to transform into NDC space
+		clipPos /= clipPos.w;
 		// Finally transform x and y to viewport space
+		clipPos.x = 0.5f * (float)width * (clipPos.x / clipPos.w + 1.0f);
+		clipPos.y = 0.5f * (float)height * (1.0f - clipPos.y / clipPos.w);
 
-		// TODO: Apply vertex assembly here
+		// eye space
+		glm::vec3 eyeSpacePos = glm::vec3(MV * vertexPos);
+		glm::vec3 eyeSpaceNorm = glm::normalize(MV_normal * vertexNorm);
+
+		// Vertex assembly 
 		// Assemble all attribute arraies into the primitive array
-		
+		VertexOut& vout = primitive.dev_verticesOut[vid];
+		vout.pos = clipPos;
+		vout.eyePos = eyeSpacePos;
+		vout.eyeNor = eyeSpaceNorm;
+		vout.col = glm::vec3(primitive.dev_materialColor);
+		//vout.col = glm::vec3(abs(vertexNorm.x), abs(vertexNorm.y), abs(vertexNorm.z)); // debug view for original normals
+		//vout.col = glm::vec3(abs(eyeSpaceNorm.x), abs(eyeSpaceNorm.y), abs(eyeSpaceNorm.z)); // debug view for eyespace normals
+
+
+		// TODO: read texture coordinates into the vertex
+		//vout.texcoord0 = primitive.dev_texcoord0[vid];
+		//vout.dev_diffuseTex = primitive.dev_diffuseTex;
 	}
 }
 
@@ -660,12 +718,12 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 
 		// TODO: other primitive types (point, line)
@@ -674,11 +732,222 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 }
 
 
+__global__
+void dividePrimToTiles(Primitive* primitives, int* tilePrimitives, unsigned int* primitivesIdxPerTile, int numPrimitives,
+	unsigned int maxPrimitivesPerTile, int tileWidth, int tileHeight, int tilePixelSize) {
+	int pidx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (pidx < numPrimitives) {
+		const Primitive& p = primitives[pidx];
+		glm::vec3 vertices[3];
+		vertices[0] = glm::vec3(p.v[0].pos);
+		vertices[1] = glm::vec3(p.v[1].pos);
+		vertices[2] = glm::vec3(p.v[2].pos);
+		AABB box = getAABBForTriangle(vertices);
+
+		// add the primitive to all tiles overlapped by the AABB
+		int tileMinX = fmaxf(0.0f, floorf(box.min.x / tilePixelSize));
+		int tileMaxX = fminf(tileHeight, ceilf(box.max.x / tilePixelSize));
+		int tileMinY = fmaxf(0.0f, floorf(box.min.y / tilePixelSize));
+		int tileMaxY = fminf(tileWidth, ceilf(box.max.y / tilePixelSize));
+		for (int i = tileMinX; i < tileMaxX; i++) {
+			for (int j = tileMinY; j < tileMaxY; j++) {
+				int tileIdx = tileWidth * i + j;
+				// get the next index to read in primitivesIdxPerTile using atomicInc
+				unsigned int nextWriteIdx = atomicInc(&primitivesIdxPerTile[tileIdx], maxPrimitivesPerTile);
+				// get write the pidx into tilePrimitives
+				tilePrimitives[tileIdx * maxPrimitivesPerTile + nextWriteIdx] = pidx;
+			}
+		}
+	}
+}
+
+
+__global__
+void rasterizePrimToFrag(Primitive* dev_primitives, Fragment* dev_fragmentBuffer, int* dev_depth, int numPrimitives, int width, int height) {
+	int pidx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (pidx < numPrimitives) {
+		const Primitive& p = dev_primitives[pidx];
+		glm::vec3 vertices[3];
+		vertices[0] = glm::vec3(p.v[0].pos);
+		vertices[1] = glm::vec3(p.v[1].pos);
+		vertices[2] = glm::vec3(p.v[2].pos);
+		AABB box = getAABBForTriangle(vertices);
+
+		// loop over all pixels of p's screen-clamped AABB to see if it's in p
+		for (int j = fmaxf(0, (int)box.min.y); j < fminf(height, (int)box.max.y + 1); j++) {  
+			for (int i = fmaxf(0, (int)box.min.x); i < fminf(width, (int)box.max.x + 1); i++) {
+				int fidx = j * width + i;
+				glm::vec2 fragmentPos = glm::vec2(i + 0.5, j + 0.5f);
+				glm::vec3 baryCoor = calculateBarycentricCoordinate(vertices, fragmentPos);
+
+				// if it is, store p's value into the pixel
+				if (isBarycentricCoordInBounds(baryCoor)) {
+					// check for depth
+					float depth = -getZAtCoordinate(baryCoor, vertices) / 10.0f; // let's say near clip is at 0 and far is at 10
+					int intDepth = depth * INT_MAX;
+					//float depth = getZAtCoordinate(baryCoor, vertices);
+					//int& intDepth = reinterpret_cast<int&>(depth);
+
+					if (intDepth < atomicMin(&dev_depth[fidx], intDepth)) {
+						Fragment& frag = dev_fragmentBuffer[fidx];
+						frag.color = p.v[0].col * baryCoor[0] + p.v[1].col * baryCoor[1] + p.v[2].col * baryCoor[2];
+						frag.eyePos = p.v[0].eyePos * baryCoor[0] + p.v[1].eyePos * baryCoor[1] + p.v[2].eyePos * baryCoor[2];
+						frag.eyeNor = p.v[0].eyeNor * baryCoor[0] + p.v[1].eyeNor * baryCoor[1] + p.v[2].eyeNor * baryCoor[2];
+						frag.eyeLightDir = glm::normalize(glm::vec3(0, 100, 100) - frag.eyePos);
+					}
+				}
+			}
+		}
+	}
+}
+
+
+__global__
+void rasterizeByTile(Primitive* primitives, int* tilePrimitives, unsigned int* primitiveIdxPerTile, Fragment* fragmentBuffer,
+	int width, int height, int tileWidth, int tileHeight, int tilePixelSize, unsigned int maxPrimitivesPerTile) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (x < tileHeight && y < tileWidth){
+		int tileIdx = x + (y * tileWidth);
+		// if this tile has no primitives, just exit
+		int primitivesInThisTile = primitiveIdxPerTile[tileIdx];
+		if (primitivesInThisTile == 0) return;
+
+		int tileOriginX = x * tilePixelSize;
+		int tileOriginY = y * tilePixelSize;
+		int* depthBuffer = (int*)malloc(tilePixelSize * tilePixelSize * sizeof(int));
+		for (int i = 0; i < tilePixelSize * tilePixelSize; i++) {
+			depthBuffer[i] = INT_MAX;
+		}
+
+		/*for (int i = 0; i < tilePixelSize; i++) {
+			for (int j = 0; j < tilePixelSize; j++) {
+				int real_i = fminf(height - 1, tileOriginX + i);
+				int real_j = fminf(width - 1, tileOriginY + j);
+				int fidx = real_i * width + real_j;
+				glm::vec2 fragmentPos = glm::vec2(real_i + 0.5, real_j + 0.5f);
+
+				Fragment& frag = fragmentBuffer[fidx];
+				frag.color = glm::vec3(1, 1, 1);
+				frag.eyePos = glm::vec3(1, 1, 1);
+				frag.eyeNor = glm::vec3(1, 1, 1);
+				frag.eyeLightDir = glm::vec3(1, 1, 1);
+			}
+		}*/
+
+		// loop over all primitives
+		for (auto k = 0u; k < primitivesInThisTile; k++) {
+			int pid = tilePrimitives[tileIdx * maxPrimitivesPerTile + k];
+			const Primitive& p = primitives[pid];
+			glm::vec3 vertices[3];
+			vertices[0] = glm::vec3(p.v[0].pos);
+			vertices[1] = glm::vec3(p.v[1].pos);
+			vertices[2] = glm::vec3(p.v[2].pos);
+			AABB box = getAABBForTriangle(vertices);
+
+			// loop over all pixels of the tile to check if it's in the triangle
+			for (int j = 0; j < tilePixelSize; j++) {
+				for (int i = 0; i < tilePixelSize; i++) {
+					int fragment_j = fminf(height - 1, tileOriginX + j);
+					int fragment_i = fminf(width - 1, tileOriginY + i);
+					int fidx = fragment_j * width + fragment_i;
+					glm::vec2 fragmentPos = glm::vec2(fragment_i + 0.5, fragment_j + 0.5f);
+					glm::vec3 baryCoor = calculateBarycentricCoordinate(vertices, fragmentPos);
+
+					// if it is, store p's value into the pixel
+					if (isBarycentricCoordInBounds(baryCoor)) {
+						// check for depth
+						float depth = -getZAtCoordinate(baryCoor, vertices) / 10.0f; // let's say near clip is at 0 and far is at 10
+						int intDepth = depth * INT_MAX;
+
+						if (intDepth < depthBuffer[j * tilePixelSize + i]) {
+							depthBuffer[j * tilePixelSize + i] = intDepth;
+							Fragment& frag = fragmentBuffer[fidx];
+							frag.color = p.v[0].col * baryCoor[0] + p.v[1].col * baryCoor[1] + p.v[2].col * baryCoor[2];
+							frag.eyePos = p.v[0].eyePos * baryCoor[0] + p.v[1].eyePos * baryCoor[1] + p.v[2].eyePos * baryCoor[2];
+							frag.eyeNor = p.v[0].eyeNor * baryCoor[0] + p.v[1].eyeNor * baryCoor[1] + p.v[2].eyeNor * baryCoor[2];
+							frag.eyeLightDir = glm::normalize(glm::vec3(0, 100, 100) - frag.eyePos);
+						}
+					}
+				}
+			}
+		}
+
+		free(depthBuffer);
+	}
+}
+
+
+__global__
+void rasterizeByPrimitivesInTile(Primitive* primitives, int* tilePrimitives, int numPrimitivesThisTile, Fragment* fragmentBuffer,
+	int width, int height, int tilePixelSize, int tileIdx, int tileOriginX, int tileOriginY, unsigned int maxPrimitivesPerTile) {
+	
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	extern __shared__ int depthBuffer[];
+	int portion = tilePixelSize * tilePixelSize / numPrimitivesThisTile;
+	int limit = (int)fminf(tilePixelSize * tilePixelSize, (idx + 1) * portion);
+	for (int i = idx * portion; i < limit; i++) {
+		depthBuffer[i] = INT_MAX;
+	}
+
+	__syncthreads();
+
+	// parallelize over primitives
+	if (idx < numPrimitivesThisTile){
+		int pid = tilePrimitives[tileIdx * maxPrimitivesPerTile + idx];
+		const Primitive& p = primitives[pid];
+		glm::vec3 vertices[3];
+		vertices[0] = glm::vec3(p.v[0].pos);
+		vertices[1] = glm::vec3(p.v[1].pos);
+		vertices[2] = glm::vec3(p.v[2].pos);
+		AABB box = getAABBForTriangle(vertices);
+
+		// loop over all pixels of the tile to check if it's in the triangle
+		for (int j = 0; j < tilePixelSize; j++) {
+			for (int i = 0; i < tilePixelSize; i++) {
+				int fragment_j = fminf(height - 1, tileOriginX + j);
+				int fragment_i = fminf(width - 1, tileOriginY + i);
+				int fidx = fragment_j * width + fragment_i;
+				glm::vec2 fragmentPos = glm::vec2(fragment_i + 0.5, fragment_j + 0.5f);
+				glm::vec3 baryCoor = calculateBarycentricCoordinate(vertices, fragmentPos);
+				
+				/*Fragment& frag = fragmentBuffer[fidx];
+				frag.color = glm::vec3(1, 1, 1);
+				frag.eyePos = glm::vec3(1, 1, 1);
+				frag.eyeNor = glm::vec3(1, 1, 1);
+				frag.eyeLightDir = glm::vec3(1, 1, 1);*/
+
+				// if it is, store p's value into the pixel
+				if (isBarycentricCoordInBounds(baryCoor)) {
+					// check for depth
+					float depth = -getZAtCoordinate(baryCoor, vertices) / 10.0f; // let's say near clip is at 0 and far is at 10
+					int intDepth = depth * INT_MAX;
+					int depthIdx = j * tilePixelSize + i;
+
+					if (intDepth < atomicMin(&depthBuffer[depthIdx], intDepth)) {
+						Fragment& frag = fragmentBuffer[fidx];
+						frag.color = p.v[0].col * baryCoor[0] + p.v[1].col * baryCoor[1] + p.v[2].col * baryCoor[2];
+						frag.eyePos = p.v[0].eyePos * baryCoor[0] + p.v[1].eyePos * baryCoor[1] + p.v[2].eyePos * baryCoor[2];
+						frag.eyeNor = p.v[0].eyeNor * baryCoor[0] + p.v[1].eyeNor * baryCoor[1] + p.v[2].eyeNor * baryCoor[2];
+						frag.eyeLightDir = glm::normalize(glm::vec3(0, 100, 100) - frag.eyePos);
+					}
+				}
+			}
+		}
+	}
+}
+
+
+
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, int renderMode) {
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
@@ -718,13 +987,61 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
 
+	dim3 numThreadsPerBlock(128);
+	dim3 numBlocksForPrimitives((totalNumPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+
+	// Rasterize
+	if (renderMode > 0) {
+		// reset the tilePrimitive and fragment buffers
+		cudaMemset(dev_tilePrimitives, -1, tileWidth * tileHeight * maxPrimitivesPerTile * sizeof(int));
+		cudaMemset(dev_primitiveIdxPerTile, 0, tileWidth * tileHeight * sizeof(unsigned int));
+		cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+		// divide primitives into tiles by their AABB
+		dividePrimToTiles << <numBlocksForPrimitives, numThreadsPerBlock >> > (dev_primitives, dev_tilePrimitives,
+			dev_primitiveIdxPerTile, totalNumPrimitives, maxPrimitivesPerTile, tileWidth, tileHeight, tilePixelSize);
+		checkCUDAError("Divide Primitives to Tiles error");
+
+		// Rasterize by tile
+		dim3 blockCountTiles2D((tileWidth - 1) / blockSize2d.x + 1,
+			(tileHeight - 1) / blockSize2d.y + 1);
+		
+		if (renderMode == 1) {
+			rasterizeByTile << <blockCountTiles2D, blockSize2d >> > (dev_primitives, dev_tilePrimitives, dev_primitiveIdxPerTile,
+				dev_fragmentBuffer, width, height, tileWidth, tileHeight, tilePixelSize, maxPrimitivesPerTile);
+			checkCUDAError("Rasterize by tile error");
+		}
+		else {
+			cudaMemcpy(hst_primitiveIdxPerTile, dev_primitiveIdxPerTile, tileWidth * tileHeight * sizeof(unsigned int), cudaMemcpyDeviceToHost);
+			for (int j = 0; j < tileHeight; j++) {
+				for (int i = 0; i < tileWidth; i++) {
+					int tileIdx = i + (j * tileWidth);
+					int tileOriginX = i * tilePixelSize;
+					int tileOriginY = j * tilePixelSize;
+
+					int numPrimitivesThisTile = hst_primitiveIdxPerTile[tileIdx];
+					if (numPrimitivesThisTile > 0) {
+						dim3 numBlocksForTilePrimitives((numPrimitivesThisTile + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+						rasterizeByPrimitivesInTile << <numBlocksForTilePrimitives, numThreadsPerBlock, tilePixelSize * tilePixelSize * sizeof(int) >> > (
+							dev_primitives, dev_tilePrimitives, numPrimitivesThisTile, dev_fragmentBuffer,
+							width, height, tilePixelSize, tileIdx, tileOriginX, tileOriginY, maxPrimitivesPerTile
+							);
+						checkCUDAError("rasterize by primitives in tile error");
+					}
+				}
+			}
+		}
+	}
+	else {
+		// reset the fragment and depth buffer
+		cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+		initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
+		checkCUDAError("initialize depth buffer");
+
+		// Rasterize
+		rasterizePrimToFrag <<<numBlocksForPrimitives, numThreadsPerBlock>>>(dev_primitives, dev_fragmentBuffer, dev_depth, totalNumPrimitives, width, height);
+		checkCUDAError("rasterize primitives to fragment buffer");
+	}
 
     // Copy depthbuffer colors into framebuffer
 	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
@@ -772,5 +1089,14 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_tilePrimitives);
+	dev_tilePrimitives = NULL;
+
+	cudaFree(dev_primitiveIdxPerTile);
+	dev_primitiveIdxPerTile = NULL;
+
+	free(hst_primitiveIdxPerTile);
+	hst_primitiveIdxPerTile = NULL;
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..e3e4845 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -16,9 +16,12 @@ namespace tinygltf{
 	class Scene;
 }
 
+#define USE_Tiles = true;
+#define TILE_SIZE = 16;
 
-void rasterizeInit(int width, int height);
+void rasterizeInit(int width, int height, int tilePixelSize);
 void rasterizeSetBuffers(const tinygltf::Scene & scene);
+void rasterizeSetTileBuffers();
 
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, int renderMode);
 void rasterizeFree();