diff --git a/README.md b/README.md
index cad1abd..42a592b 100644
--- a/README.md
+++ b/README.md
@@ -1,20 +1,108 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Name: Zhan Xiong Chin
+* Tested on: Windows 7 Professional, Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70 GHz 3.70 GHz, GTX 1070 8192MB (SIG Lab)
+
+![](renders/intro.gif)
+
+## Overview
+
+GPU-based rasterizer. Features:
+
+* Basic rendering of .gltf models
+* UV texture mapping, with bilinear filtering and perspective correct texture coordinates
+* Lambert shading
+* Backface culling
+* Toon shading
+
+## Overview of pipeline
+
+The rasterization pipeline consists of the following steps:
+
+* Vertex shading
+	* Transforms world coordinates into window space coordinates
+* Primitive assembly
+	* Creates triangles from vertices
+* Backface culling
+	* Culls backward-facing triangles
+* Rasterization
+	* For each triangle, calculates which pixels on screen it hits
+	* Uses bounding-box to speed up above calculation
+	* Do depth test to see if fragment is the closest to camera
+	* Calculate texture coordinates, etc. of point using barycentric coordinates
+	* Uses perspective correct texture coordinates 
+* Texture mapping
+	* Calculates color from texture coordinates
+	* Uses bilinear filtering to smooth out low-res textures
+* Edge detection (toon shading only)
+	* Uses Sobel filter to find edges
+	* Optimized using shared memory
+* Fragment shading
+	* Calculates color of each pixel using Lambert shading
+	* Quantizes colors (toon shading only)
+	* Uses results of Sobel filter to darken edges (toon shading only)
+
+## Performance overview
+
+Some render timings are given below, broken down by time spent in each stage of the pipeline. Rasterization tends to take the longest to complete, most likely because of the complex calculations as well as the locking required to do the depth test. There are some models where this is a serious problem, such as the Cesium milk truck render, where the rasterizer alone takes 200000 microseconds to complete, even when all other kernels have similar timings to the ones listed below. The models used for the below timings are shown in the introduction.
+
+![](renders/timings.png)
+
+## Features
+
+### Backface culling
+
+Backface culling was used (with thrust's remove_if) to reduce the number of triangles needed to render a model. For the Stanford dragon (shown below in toon shading), backface culling reduced the number of triangles from 100000 to 45704, speeding up the rendering as shown below. 
+
+![](renders/backface_culling_timing.png)
+
+### Texture mapping
+
+Texture mapping has two tweaks to the basic algorithm, bilinear interpolation and perspective correctness.
+
+For perspective correctness, the texture coordinates are adjusted according to the depths of the fragments. The importance of this correction can be seen in the below chessboard renders. The above has the correction applied, whereas the second does not.
+
+![](renders/chessboard_perspective_correct.png)
+
+![](renders/chessboard_perspective_incorrect.png)
+
+Bilinear filtering is also applied to smooth out textures, by interpolating between the 4 pixels from the texture a fragment borders. This results in a smoother texture, as can be seen below. Bilinear filtering is applied to the left texture, but not the right one.
+
+<img src="renders/cesium_bilinear.png" style="width: 275px; height: 275px" />
+<img src="renders/cesium_no_bilinear.png" style="width: 275px; height: 275px" />
+
+Both perspective correctness and bilinear filtering complicate the texture mapping code, slowing down the texture mapping kernel by about 50% when rendering the image shown in the introduction. However, this kernel is not a significant percentage of the code, so this slowdown is negligible overall.
+
+### Toon shading
+
+Toon shading consists of two steps: color quantization and edge detection. 
+
+![](renders/dragon_celshaded.png)
+
+_Stanford dragon, toon shading with borders_
+
+Color quantization creates a small number of discrete shades, causing a "flat" look rather than a smooth gradient of colors.
+
+![](renders/dragon_lambert_cel.png)
+
+_Stanford dragon, toon shading (no borders)_
+
+![](renders/dragon_lambert.png)
+
+_Stanford dragon, Lambert shading. Note the smooth coloring on the scales compared to the "blotchy" toon shading_
+
+Then, edge detection is performed using a [Sobel filter](https://en.wikipedia.org/wiki/Sobel_operator), and all edges are rendered black.
+
+![](renders/dragon_outline.png)
 
-### (TODO: Your README)
+_Stanford dragon, Sobel filter outline_
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+The two are combined at the end to achieve the "toon" effect.
 
 
-### Credits
+The toon shader does decrease performance, especially since an additional kernel needs to be executed for the Sobel filter, which comprises approximately 25% of the rendering time. To optimize this, a version of the Sobel filter was written that calculated the convolution in a grid pattern using shared memory. This halved the time required for the Sobel filter step from about 1.8 milliseconds to 0.9 milliseconds. 
 
-* [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
-* [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+![](renders/toon_shading_timing.png)
\ No newline at end of file
diff --git a/gltfs/dragon/dragon.glb b/gltfs/dragon/dragon.glb
new file mode 100644
index 0000000..8c323c4
Binary files /dev/null and b/gltfs/dragon/dragon.glb differ
diff --git a/renders/backface_culling_timing.png b/renders/backface_culling_timing.png
new file mode 100644
index 0000000..571ae00
Binary files /dev/null and b/renders/backface_culling_timing.png differ
diff --git a/renders/cesium_bilinear.png b/renders/cesium_bilinear.png
new file mode 100644
index 0000000..5a4b84d
Binary files /dev/null and b/renders/cesium_bilinear.png differ
diff --git a/renders/cesium_no_bilinear.png b/renders/cesium_no_bilinear.png
new file mode 100644
index 0000000..ddb5ea8
Binary files /dev/null and b/renders/cesium_no_bilinear.png differ
diff --git a/renders/chessboard_perspective_correct.png b/renders/chessboard_perspective_correct.png
new file mode 100644
index 0000000..718823d
Binary files /dev/null and b/renders/chessboard_perspective_correct.png differ
diff --git a/renders/chessboard_perspective_incorrect.png b/renders/chessboard_perspective_incorrect.png
new file mode 100644
index 0000000..d9b7f75
Binary files /dev/null and b/renders/chessboard_perspective_incorrect.png differ
diff --git a/renders/dragon_celshaded.png b/renders/dragon_celshaded.png
new file mode 100644
index 0000000..288533b
Binary files /dev/null and b/renders/dragon_celshaded.png differ
diff --git a/renders/dragon_lambert.png b/renders/dragon_lambert.png
new file mode 100644
index 0000000..faa8625
Binary files /dev/null and b/renders/dragon_lambert.png differ
diff --git a/renders/dragon_lambert_cel.png b/renders/dragon_lambert_cel.png
new file mode 100644
index 0000000..5bde7eb
Binary files /dev/null and b/renders/dragon_lambert_cel.png differ
diff --git a/renders/dragon_outline.png b/renders/dragon_outline.png
new file mode 100644
index 0000000..e4ac7b7
Binary files /dev/null and b/renders/dragon_outline.png differ
diff --git a/renders/intro.gif b/renders/intro.gif
new file mode 100644
index 0000000..823492b
Binary files /dev/null and b/renders/intro.gif differ
diff --git a/renders/milk_truck.png b/renders/milk_truck.png
new file mode 100644
index 0000000..e789bdb
Binary files /dev/null and b/renders/milk_truck.png differ
diff --git a/renders/timings.png b/renders/timings.png
new file mode 100644
index 0000000..6933ee9
Binary files /dev/null and b/renders/timings.png differ
diff --git a/renders/toon_shading_timing.png b/renders/toon_shading_timing.png
new file mode 100644
index 0000000..2956405
Binary files /dev/null and b/renders/toon_shading_timing.png differ
diff --git a/src/main.cpp b/src/main.cpp
index a36b955..91cd3e5 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -19,111 +19,112 @@
 //-------------------------------
 
 int main(int argc, char **argv) {
-    if (argc != 2) {
-        cout << "Usage: [gltf file]" << endl;
-        return 0;
-    }
-
-	tinygltf::Scene scene;
-	tinygltf::TinyGLTFLoader loader;
-	std::string err;
-	std::string input_filename(argv[1]);
-	std::string ext = getFilePathExtension(input_filename);
-
-	bool ret = false;
-	if (ext.compare("glb") == 0) {
-		// assume binary glTF.
-		ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
-	} else {
-		// assume ascii glTF.
-		ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
-	}
-
-	if (!err.empty()) {
-		printf("Err: %s\n", err.c_str());
-	}
-
-	if (!ret) {
-		printf("Failed to parse glTF\n");
-		return -1;
-	}
-
-
-    frame = 0;
-    seconds = time(NULL);
-    fpstracker = 0;
-
-    // Launch CUDA/GL
-    if (init(scene)) {
-        // GLFW main loop
-        mainLoop();
-    }
-
+  if (argc != 2) {
+    cout << "Usage: [gltf file]" << endl;
     return 0;
+  }
+
+  tinygltf::Scene scene;
+  tinygltf::TinyGLTFLoader loader;
+  std::string err;
+  std::string input_filename(argv[1]);
+  std::string ext = getFilePathExtension(input_filename);
+
+  bool ret = false;
+  if (ext.compare("glb") == 0) {
+    // assume binary glTF.
+    ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
+  }
+  else {
+    // assume ascii glTF.
+    ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
+  }
+
+  if (!err.empty()) {
+    printf("Err: %s\n", err.c_str());
+  }
+
+  if (!ret) {
+    printf("Failed to parse glTF\n");
+    return -1;
+  }
+
+
+  frame = 0;
+  seconds = time(NULL);
+  fpstracker = 0;
+
+  // Launch CUDA/GL
+  if (init(scene)) {
+    // GLFW main loop
+    mainLoop();
+  }
+
+  return 0;
 }
 
 void mainLoop() {
-    while (!glfwWindowShouldClose(window)) {
-        glfwPollEvents();
-        runCuda();
+  while (!glfwWindowShouldClose(window)) {
+    glfwPollEvents();
+    runCuda();
 
-        time_t seconds2 = time (NULL);
+    time_t seconds2 = time(NULL);
 
-        if (seconds2 - seconds >= 1) {
+    if (seconds2 - seconds >= 1) {
 
-            fps = fpstracker / (seconds2 - seconds);
-            fpstracker = 0;
-            seconds = seconds2;
-        }
-
-        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
-        glfwSetWindowTitle(window, title.c_str());
+      fps = fpstracker / (seconds2 - seconds);
+      fpstracker = 0;
+      seconds = seconds2;
+    }
 
-        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
-        glBindTexture(GL_TEXTURE_2D, displayImage);
-        glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
-        glClear(GL_COLOR_BUFFER_BIT);
+    string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+    glfwSetWindowTitle(window, title.c_str());
 
-        // VAO, shader program, and texture already bound
-        glDrawElements(GL_TRIANGLES, 6,  GL_UNSIGNED_SHORT, 0);
-        glfwSwapBuffers(window);
-    }
-    glfwDestroyWindow(window);
-    glfwTerminate();
+    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+    glBindTexture(GL_TEXTURE_2D, displayImage);
+    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
+    glClear(GL_COLOR_BUFFER_BIT);
+
+    // VAO, shader program, and texture already bound
+    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_SHORT, 0);
+    glfwSwapBuffers(window);
+  }
+  glfwDestroyWindow(window);
+  glfwTerminate();
 }
 
 //-------------------------------
 //---------RUNTIME STUFF---------
 //-------------------------------
 float scale = 1.0f;
-float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
+float x_trans = 0.0f, y_trans = 0.0f, z_trans = -5.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
-    // Map OpenGL buffer object for writing from CUDA on a single GPU
-    // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
-    dptr = NULL;
+  // Map OpenGL buffer object for writing from CUDA on a single GPU
+  // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
+  dptr = NULL;
 
-	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
-		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+  glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
+    scale * ((float)width / (float)height),
+    -scale, scale, 1.0, 1000.0);
 
-	glm::mat4 V = glm::mat4(1.0f);
+  glm::mat4 V = glm::mat4(1.0f);
 
-	glm::mat4 M =
-		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
-		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+  glm::mat4 M =
+    glm::translate(glm::vec3(x_trans, y_trans, z_trans))
+    * glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
+    * glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
 
-	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
-	glm::mat4 MV = V * M;
-	glm::mat4 MVP = P * MV;
+  glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
+  glm::mat4 MV = V * M;
+  glm::mat4 MVP = P * MV;
 
-    cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
-    cudaGLUnmapBufferObject(pbo);
+  cudaGLMapBufferObject((void **)&dptr, pbo);
+  rasterize(dptr, MVP, MV, MV_normal);
+  cudaGLUnmapBufferObject(pbo);
 
-    frame++;
-    fpstracker++;
+  frame++;
+  fpstracker++;
 }
 
 //-------------------------------
@@ -131,149 +132,149 @@ void runCuda() {
 //-------------------------------
 
 bool init(const tinygltf::Scene & scene) {
-    glfwSetErrorCallback(errorCallback);
+  glfwSetErrorCallback(errorCallback);
 
-    if (!glfwInit()) {
-        return false;
-    }
+  if (!glfwInit()) {
+    return false;
+  }
 
-    width = 800;
-    height = 800;
-    window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
-    if (!window) {
-        glfwTerminate();
-        return false;
-    }
-    glfwMakeContextCurrent(window);
-    glfwSetKeyCallback(window, keyCallback);
-
-    // Set up GL context
-    glewExperimental = GL_TRUE;
-    if (glewInit() != GLEW_OK) {
-        return false;
+  width = 800;
+  height = 800;
+  window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
+  if (!window) {
+    glfwTerminate();
+    return false;
+  }
+  glfwMakeContextCurrent(window);
+  glfwSetKeyCallback(window, keyCallback);
+
+  // Set up GL context
+  glewExperimental = GL_TRUE;
+  if (glewInit() != GLEW_OK) {
+    return false;
+  }
+
+  // Initialize other stuff
+  initVAO();
+  initTextures();
+  initCuda();
+  initPBO();
+
+  // Mouse Control Callbacks
+  glfwSetMouseButtonCallback(window, mouseButtonCallback);
+  glfwSetCursorPosCallback(window, mouseMotionCallback);
+  glfwSetScrollCallback(window, mouseWheelCallback);
+
+  {
+    std::map<std::string, std::vector<std::string> >::const_iterator it(
+      scene.scenes.begin());
+    std::map<std::string, std::vector<std::string> >::const_iterator itEnd(
+      scene.scenes.end());
+
+    for (; it != itEnd; it++) {
+      for (size_t i = 0; i < it->second.size(); i++) {
+        std::cout << it->second[i]
+          << ((i != (it->second.size() - 1)) ? ", " : "");
+      }
+      std::cout << " ] " << std::endl;
     }
+  }
 
-    // Initialize other stuff
-    initVAO();
-    initTextures();
-    initCuda();
-	initPBO();
-
-	// Mouse Control Callbacks
-	glfwSetMouseButtonCallback(window, mouseButtonCallback);
-	glfwSetCursorPosCallback(window, mouseMotionCallback);
-	glfwSetScrollCallback(window, mouseWheelCallback);
 
-	{
-		std::map<std::string, std::vector<std::string> >::const_iterator it(
-			scene.scenes.begin());
-		std::map<std::string, std::vector<std::string> >::const_iterator itEnd(
-			scene.scenes.end());
+  rasterizeSetBuffers(scene);
 
-		for (; it != itEnd; it++) {
-			for (size_t i = 0; i < it->second.size(); i++) {
-				std::cout << it->second[i]
-					<< ((i != (it->second.size() - 1)) ? ", " : "");
-			}
-			std::cout << " ] " << std::endl;
-		}
-	}
+  GLuint passthroughProgram;
+  passthroughProgram = initShader();
 
+  glUseProgram(passthroughProgram);
+  glActiveTexture(GL_TEXTURE0);
 
-	rasterizeSetBuffers(scene);
-
-    GLuint passthroughProgram;
-    passthroughProgram = initShader();
-
-    glUseProgram(passthroughProgram);
-    glActiveTexture(GL_TEXTURE0);
-
-    return true;
+  return true;
 }
 
 void initPBO() {
-    // set up vertex data parameter
-    int num_texels = width * height;
-    int num_values = num_texels * 4;
-    int size_tex_data = sizeof(GLubyte) * num_values;
+  // set up vertex data parameter
+  int num_texels = width * height;
+  int num_values = num_texels * 4;
+  int size_tex_data = sizeof(GLubyte) * num_values;
 
-    // Generate a buffer ID called a PBO (Pixel Buffer Object)
-    glGenBuffers(1, &pbo);
+  // Generate a buffer ID called a PBO (Pixel Buffer Object)
+  glGenBuffers(1, &pbo);
 
-    // Make this the current UNPACK buffer (OpenGL is state-based)
-    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+  // Make this the current UNPACK buffer (OpenGL is state-based)
+  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
 
-    // Allocate data for the buffer. 4-channel 8-bit image
-    glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
-    cudaGLRegisterBufferObject(pbo);
+  // Allocate data for the buffer. 4-channel 8-bit image
+  glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
+  cudaGLRegisterBufferObject(pbo);
 
 }
 
 void initCuda() {
-    // Use device with highest Gflops/s
-    cudaGLSetGLDevice(0);
+  // Use device with highest Gflops/s
+  cudaGLSetGLDevice(0);
 
-    rasterizeInit(width, height);
+  rasterizeInit(width, height);
 
-    // Clean up on program exit
-    atexit(cleanupCuda);
+  // Clean up on program exit
+  atexit(cleanupCuda);
 }
 
 void initTextures() {
-    glGenTextures(1, &displayImage);
-    glBindTexture(GL_TEXTURE_2D, displayImage);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
-    glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
-                  GL_UNSIGNED_BYTE, NULL);
+  glGenTextures(1, &displayImage);
+  glBindTexture(GL_TEXTURE_2D, displayImage);
+  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
+  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
+  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
+    GL_UNSIGNED_BYTE, NULL);
 }
 
 void initVAO(void) {
-    GLfloat vertices[] = {
-        -1.0f, -1.0f,
-        1.0f, -1.0f,
-        1.0f,  1.0f,
-        -1.0f,  1.0f,
-    };
-
-    GLfloat texcoords[] = {
-        1.0f, 1.0f,
-        0.0f, 1.0f,
-        0.0f, 0.0f,
-        1.0f, 0.0f
-    };
-
-    GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
-
-    GLuint vertexBufferObjID[3];
-    glGenBuffers(3, vertexBufferObjID);
-
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
-    glEnableVertexAttribArray(positionLocation);
-
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
-    glEnableVertexAttribArray(texcoordsLocation);
-
-    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
-    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
+  GLfloat vertices[] = {
+    -1.0f, -1.0f,
+    1.0f, -1.0f,
+    1.0f, 1.0f,
+    -1.0f, 1.0f,
+  };
+
+  GLfloat texcoords[] = {
+    1.0f, 1.0f,
+    0.0f, 1.0f,
+    0.0f, 0.0f,
+    1.0f, 0.0f
+  };
+
+  GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
+
+  GLuint vertexBufferObjID[3];
+  glGenBuffers(3, vertexBufferObjID);
+
+  glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
+  glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
+  glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+  glEnableVertexAttribArray(positionLocation);
+
+  glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
+  glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
+  glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+  glEnableVertexAttribArray(texcoordsLocation);
+
+  glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
+  glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
 }
 
 
 GLuint initShader() {
-    const char *attribLocations[] = { "Position", "Tex" };
-    GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
-    GLint location;
+  const char *attribLocations[] = { "Position", "Tex" };
+  GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
+  GLint location;
 
-    glUseProgram(program);
-    if ((location = glGetUniformLocation(program, "u_image")) != -1) {
-        glUniform1i(location, 0);
-    }
+  glUseProgram(program);
+  if ((location = glGetUniformLocation(program, "u_image")) != -1) {
+    glUniform1i(location, 0);
+  }
 
-    return program;
+  return program;
 }
 
 //-------------------------------
@@ -281,38 +282,38 @@ GLuint initShader() {
 //-------------------------------
 
 void cleanupCuda() {
-    if (pbo) {
-        deletePBO(&pbo);
-    }
-    if (displayImage) {
-        deleteTexture(&displayImage);
-    }
+  if (pbo) {
+    deletePBO(&pbo);
+  }
+  if (displayImage) {
+    deleteTexture(&displayImage);
+  }
 }
 
 void deletePBO(GLuint *pbo) {
-    if (pbo) {
-        // unregister this buffer object with CUDA
-        cudaGLUnregisterBufferObject(*pbo);
+  if (pbo) {
+    // unregister this buffer object with CUDA
+    cudaGLUnregisterBufferObject(*pbo);
 
-        glBindBuffer(GL_ARRAY_BUFFER, *pbo);
-        glDeleteBuffers(1, pbo);
+    glBindBuffer(GL_ARRAY_BUFFER, *pbo);
+    glDeleteBuffers(1, pbo);
 
-        *pbo = (GLuint)NULL;
-    }
+    *pbo = (GLuint)NULL;
+  }
 }
 
 void deleteTexture(GLuint *tex) {
-    glDeleteTextures(1, tex);
-    *tex = (GLuint)NULL;
+  glDeleteTextures(1, tex);
+  *tex = (GLuint)NULL;
 }
 
 void shut_down(int return_code) {
-    rasterizeFree();
-    cudaDeviceReset();
+  rasterizeFree();
+  cudaDeviceReset();
 #ifdef __APPLE__
-    glfwTerminate();
+  glfwTerminate();
 #endif
-    exit(return_code);
+  exit(return_code);
 }
 
 //------------------------------
@@ -320,22 +321,22 @@ void shut_down(int return_code) {
 //------------------------------
 
 void errorCallback(int error, const char *description) {
-    fputs(description, stderr);
+  fputs(description, stderr);
 }
 
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
-        glfwSetWindowShouldClose(window, GL_TRUE);
-    }
+  if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
+    glfwSetWindowShouldClose(window, GL_TRUE);
+  }
 }
 
 //----------------------------
 //----- util -----------------
 //----------------------------
 static std::string getFilePathExtension(const std::string &FileName) {
-	if (FileName.find_last_of(".") != std::string::npos)
-		return FileName.substr(FileName.find_last_of(".") + 1);
-	return "";
+  if (FileName.find_last_of(".") != std::string::npos)
+    return FileName.substr(FileName.find_last_of(".") + 1);
+  return "";
 }
 
 
@@ -348,52 +349,52 @@ enum ControlState { NONE = 0, ROTATE, TRANSLATE };
 ControlState mouseState = NONE;
 void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods)
 {
-	if (action == GLFW_PRESS)
-	{
-		if (button == GLFW_MOUSE_BUTTON_LEFT)
-		{
-			mouseState = ROTATE;
-		}
-		else if (button == GLFW_MOUSE_BUTTON_RIGHT)
-		{
-			mouseState = TRANSLATE;
-		}
-
-	}
-	else if (action == GLFW_RELEASE)
-	{
-		mouseState = NONE;
-	}
+  if (action == GLFW_PRESS)
+  {
+    if (button == GLFW_MOUSE_BUTTON_LEFT)
+    {
+      mouseState = ROTATE;
+    }
+    else if (button == GLFW_MOUSE_BUTTON_RIGHT)
+    {
+      mouseState = TRANSLATE;
+    }
+
+  }
+  else if (action == GLFW_RELEASE)
+  {
+    mouseState = NONE;
+  }
 }
 
 double lastx = (double)width / 2;
 double lasty = (double)height / 2;
 void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 {
-	const double s_r = 0.01;
-	const double s_t = 0.01;
-
-	double diffx = xpos - lastx;
-	double diffy = ypos - lasty;
-	lastx = xpos;
-	lasty = ypos;
-
-	if (mouseState == ROTATE)
-	{
-		//rotate
-		x_angle += (float)s_r * diffy;
-		y_angle += (float)s_r * diffx;
-	}
-	else if (mouseState == TRANSLATE)
-	{
-		//translate
-		x_trans += (float)(s_t * diffx);
-		y_trans += (float)(-s_t * diffy);
-	}
+  const double s_r = 0.01;
+  const double s_t = 0.01;
+
+  double diffx = xpos - lastx;
+  double diffy = ypos - lasty;
+  lastx = xpos;
+  lasty = ypos;
+
+  if (mouseState == ROTATE)
+  {
+    //rotate
+    x_angle += (float)s_r * diffy;
+    y_angle += (float)s_r * diffx;
+  }
+  else if (mouseState == TRANSLATE)
+  {
+    //translate
+    x_trans += (float)(s_t * diffx);
+    y_trans += (float)(-s_t * diffy);
+  }
 }
 
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
-	const double s = 1.0;	// sensitivity
-	z_trans += (float)(s * yoffset);
+  const double s = 1.0;	// sensitivity
+  z_trans += (float)(s * yoffset);
 }
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..15da344 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -10,93 +10,136 @@
 #include <cstdio>
 #include <cuda.h>
 #include <cuda_runtime.h>
+#include <thrust/device_ptr.h>
 #include <thrust/random.h>
+#include <thrust/remove.h>
 #include <util/checkCUDAError.h>
 #include <util/tiny_gltf_loader.h>
 #include "rasterizeTools.h"
 #include "rasterize.h"
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
-
-namespace {
-
-	typedef unsigned short VertexIndex;
-	typedef glm::vec3 VertexAttributePosition;
-	typedef glm::vec3 VertexAttributeNormal;
-	typedef glm::vec2 VertexAttributeTexcoord;
-	typedef unsigned char TextureData;
-
-	typedef unsigned char BufferByte;
-
-	enum PrimitiveType{
-		Point = 1,
-		Line = 2,
-		Triangle = 3
-	};
-
-	struct VertexOut {
-		glm::vec4 pos;
-
-		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
-	};
-
-	struct Primitive {
-		PrimitiveType primitiveType = Triangle;	// C++ 11 init
-		VertexOut v[3];
-	};
-
-	struct Fragment {
-		glm::vec3 color;
-
-		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
-	};
-
-	struct PrimitiveDevBufPointers {
-		int primitiveMode;	//from tinygltfloader macro
-		PrimitiveType primitiveType;
-		int numPrimitives;
-		int numIndices;
-		int numVertices;
-
-		// Vertex In, const after loaded
-		VertexIndex* dev_indices;
-		VertexAttributePosition* dev_position;
-		VertexAttributeNormal* dev_normal;
-		VertexAttributeTexcoord* dev_texcoord0;
-
-		// Materials, add more attributes when needed
-		TextureData* dev_diffuseTex;
-		int diffuseTexWidth;
-		int diffuseTexHeight;
-		// TextureData* dev_specularTex;
-		// TextureData* dev_normalTex;
-		// ...
-
-		// Vertex Out, vertex used for rasterization, this is changing every frame
-		VertexOut* dev_verticesOut;
-
-		// TODO: add more attributes when needed
-	};
+#include <iostream>
+#include <vector>
+
+#define TEXTURE_MAP 1
+#define PERSPECTIVE_CORRECT 1
+#define BILINEAR_INTERPOLATION 1
+#define BACKFACE_CULL 1
+#define NORMAL_INTERPOLATE 1
+
+#define CEL_SHADE 4
+#define SOBEL_GRID 8
+#define USE_SHARED_SOBEL 1
+
+namespace rasterizer {
+
+  typedef unsigned short VertexIndex;
+  typedef glm::vec3 VertexAttributePosition;
+  typedef glm::vec3 VertexAttributeNormal;
+  typedef glm::vec2 VertexAttributeTexcoord;
+  typedef unsigned char TextureData;
+
+  typedef unsigned char BufferByte;
+
+  enum PrimitiveType{
+    Point = 1,
+    Line = 2,
+    Triangle = 3
+  };
+
+  struct VertexOut {
+    glm::vec4 pos;
+
+    // TODO: add new attributes to your VertexOut
+    // The attributes listed below might be useful, 
+    // but always feel free to modify on your own
+
+    glm::vec3 eyePos;	// eye space position used for shading
+    glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
+    // glm::vec3 col;
+#if TEXTURE_MAP == 1
+    glm::vec2 texcoord0;
+    TextureData* dev_diffuseTex = NULL;
+    int texWidth, texHeight, texComp;
+#endif
+    // ...
+  };
+
+  struct Primitive {
+    PrimitiveType primitiveType = Triangle;	// C++ 11 init
+    VertexOut v[3];
+  };
+
+
+
+  struct Fragment {
+    glm::vec3 color;
+
+    // TODO: add new attributes to your Fragment
+    // The attributes listed below might be useful, 
+    // but always feel free to modify on your own
+
+    glm::vec3 eyePos;	// eye space position used for shading
+    glm::vec3 eyeNor;
+
+    float z;
+    float sobelx;
+    float sobely;
+
+    TextureData * diffuseTex;
+    int texWidth;
+    int texHeight;
+    int texComp;
+    glm::vec2 texcoord0;
+  };
+
+  struct FragmentMutex {
+    int mutex;
+  };
+
+  struct PrimitiveDevBufPointers {
+    int primitiveMode;	//from tinygltfloader macro
+    PrimitiveType primitiveType;
+    int numPrimitives;
+    int numIndices;
+    int numVertices;
+
+    // Vertex In, const after loaded
+    VertexIndex* dev_indices;
+    VertexAttributePosition* dev_position;
+    VertexAttributeNormal* dev_normal;
+    VertexAttributeTexcoord* dev_texcoord0;
+
+    // Materials, add more attributes when needed
+#if TEXTURE_MAP == 1
+    TextureData* dev_diffuseTex;
+    int texWidth;
+    int texHeight;
+    int texComp;
+#endif
+    // TextureData* dev_specularTex;
+    // TextureData* dev_normalTex;
+    // ...
+
+    // Vertex Out, vertex used for rasterization, this is changing every frame
+    VertexOut* dev_verticesOut;
+
+    // TODO: add more attributes when needed
+  };
 
 }
+using namespace rasterizer;
+
+struct Light {
+  glm::vec4 worldPos;
+  glm::vec3 eyePos;
+  float emittance;
+  Light(glm::vec4 worldPos, float emittance) {
+    this->worldPos = worldPos;
+    this->emittance = emittance;
+  }
+};
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
@@ -104,82 +147,73 @@ static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2Primitiv
 static int width = 0;
 static int height = 0;
 
+#define AMBIENT_LIGHT 0.2f
+std::vector<Light> lights = { Light(glm::vec4(0.0f, 10.0f, 4.0f, 1.0f), 1.0f) };
+
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
-
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static Light *dev_lights = NULL;
+static FragmentMutex *dev_fragmentMutexes = NULL;
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
-__global__ 
+__global__
 void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
-    if (x < w && y < h) {
-        glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
-        // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
-    }
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+
+  if (x < w && y < h) {
+    glm::vec3 color;
+    color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
+    color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
+    color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+    // Each thread writes one pixel location in the texture (textel)
+    pbo[index].w = 0;
+    pbo[index].x = color.x;
+    pbo[index].y = color.y;
+    pbo[index].z = color.z;
+  }
 }
 
-/** 
-* Writes fragment colors to the framebuffer
-*/
-__global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
-
-		// TODO: add your fragment shader code here
 
-    }
-}
 
 /**
  * Called once at the beginning of the program to allocate memory.
  */
 void rasterizeInit(int w, int h) {
-    width = w;
-    height = h;
-	cudaFree(dev_fragmentBuffer);
-	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-    cudaFree(dev_framebuffer);
-    cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
-    cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
-	cudaFree(dev_depth);
-	cudaMalloc(&dev_depth, width * height * sizeof(int));
-
-	checkCUDAError("rasterizeInit");
+  width = w;
+  height = h;
+  cudaFree(dev_fragmentBuffer);
+  cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  cudaFree(dev_framebuffer);
+  cudaMalloc(&dev_framebuffer, width * height * sizeof(glm::vec3));
+  cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+
+  cudaFree(dev_lights);
+  cudaMalloc(&dev_lights, lights.size() * sizeof(Light));
+
+  cudaFree(dev_fragmentMutexes);
+  cudaMalloc(&dev_fragmentMutexes, width * height * sizeof(FragmentMutex));
+
+  checkCUDAError("rasterizeInit");
 }
 
 __global__
-void initDepth(int w, int h, int * depth)
-{
-	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-
-	if (x < w && y < h)
-	{
-		int index = x + (y * w);
-		depth[index] = INT_MAX;
-	}
+void initMutexes(int w, int h, FragmentMutex * mutexes, Fragment * fragments) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x < w && y < h)
+  {
+    int index = x + (y * w);
+    mutexes[index].mutex = 0;
+    fragments[index].z = FLT_MAX;
+  }
 }
 
 
@@ -187,551 +221,832 @@ void initDepth(int w, int h, int * depth)
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
 */
-__global__ 
+__global__
 void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) {
-	
-	// Attribute (vec3 position)
-	// component (3 * float)
-	// byte (4 * byte)
-
-	// id of component
-	int i = (blockIdx.x * blockDim.x) + threadIdx.x;
-
-	if (i < N) {
-		int count = i / n;
-		int offset = i - count * n;	// which component of the attribute
-
-		for (int j = 0; j < componentTypeByteSize; j++) {
-			
-			dev_dst[count * componentTypeByteSize * n 
-				+ offset * componentTypeByteSize 
-				+ j]
-
-				= 
-
-			dev_src[byteOffset 
-				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride) 
-				+ offset * componentTypeByteSize 
-				+ j];
-		}
-	}
-	
+
+  // Attribute (vec3 position)
+  // component (3 * float)
+  // byte (4 * byte)
+
+  // id of component
+  int i = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (i < N) {
+    int count = i / n;
+    int offset = i - count * n;	// which component of the attribute
+
+    for (int j = 0; j < componentTypeByteSize; j++) {
+
+      dev_dst[count * componentTypeByteSize * n
+        + offset * componentTypeByteSize
+        + j]
+
+        =
+
+        dev_src[byteOffset
+        + count * (byteStride == 0 ? componentTypeByteSize * n : byteStride)
+        + offset * componentTypeByteSize
+        + j];
+    }
+  }
+
 
 }
 
 __global__
 void _nodeMatrixTransform(
-	int numVertices,
-	VertexAttributePosition* position,
-	VertexAttributeNormal* normal,
-	glm::mat4 MV, glm::mat3 MV_normal) {
-
-	// vertex id
-	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-		position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
-		normal[vid] = glm::normalize(MV_normal * normal[vid]);
-	}
+int numVertices,
+VertexAttributePosition* position,
+VertexAttributeNormal* normal,
+glm::mat4 MV, glm::mat3 MV_normal) {
+
+  // vertex id
+  int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (vid < numVertices) {
+    position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
+    normal[vid] = glm::normalize(MV_normal * normal[vid]);
+  }
 }
 
 glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
-	
-	glm::mat4 curMatrix(1.0);
-
-	const std::vector<double> &m = n.matrix;
-	if (m.size() > 0) {
-		// matrix, copy it
-
-		for (int i = 0; i < 4; i++) {
-			for (int j = 0; j < 4; j++) {
-				curMatrix[i][j] = (float)m.at(4 * i + j);
-			}
-		}
-	} else {
-		// no matrix, use rotation, scale, translation
-
-		if (n.translation.size() > 0) {
-			curMatrix[3][0] = n.translation[0];
-			curMatrix[3][1] = n.translation[1];
-			curMatrix[3][2] = n.translation[2];
-		}
-
-		if (n.rotation.size() > 0) {
-			glm::mat4 R;
-			glm::quat q;
-			q[0] = n.rotation[0];
-			q[1] = n.rotation[1];
-			q[2] = n.rotation[2];
-
-			R = glm::mat4_cast(q);
-			curMatrix = curMatrix * R;
-		}
-
-		if (n.scale.size() > 0) {
-			curMatrix = curMatrix * glm::scale(glm::vec3(n.scale[0], n.scale[1], n.scale[2]));
-		}
-	}
-
-	return curMatrix;
+
+  glm::mat4 curMatrix(1.0);
+
+  const std::vector<double> &m = n.matrix;
+  if (m.size() > 0) {
+    // matrix, copy it
+
+    for (int i = 0; i < 4; i++) {
+      for (int j = 0; j < 4; j++) {
+        curMatrix[i][j] = (float)m.at(4 * i + j);
+      }
+    }
+  }
+  else {
+    // no matrix, use rotation, scale, translation
+
+    if (n.translation.size() > 0) {
+      curMatrix[3][0] = n.translation[0];
+      curMatrix[3][1] = n.translation[1];
+      curMatrix[3][2] = n.translation[2];
+    }
+
+    if (n.rotation.size() > 0) {
+      glm::mat4 R;
+      glm::quat q;
+      q[0] = n.rotation[0];
+      q[1] = n.rotation[1];
+      q[2] = n.rotation[2];
+
+      R = glm::mat4_cast(q);
+      curMatrix = curMatrix * R;
+    }
+
+    if (n.scale.size() > 0) {
+      curMatrix = curMatrix * glm::scale(glm::vec3(n.scale[0], n.scale[1], n.scale[2]));
+    }
+  }
+
+  return curMatrix;
 }
 
-void traverseNode (
-	std::map<std::string, glm::mat4> & n2m,
-	const tinygltf::Scene & scene,
-	const std::string & nodeString,
-	const glm::mat4 & parentMatrix
-	) 
+void traverseNode(
+  std::map<std::string, glm::mat4> & n2m,
+  const tinygltf::Scene & scene,
+  const std::string & nodeString,
+  const glm::mat4 & parentMatrix
+  )
 {
-	const tinygltf::Node & n = scene.nodes.at(nodeString);
-	glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
-	n2m.insert(std::pair<std::string, glm::mat4>(nodeString, M));
+  const tinygltf::Node & n = scene.nodes.at(nodeString);
+  glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
+  n2m.insert(std::pair<std::string, glm::mat4>(nodeString, M));
 
-	auto it = n.children.begin();
-	auto itEnd = n.children.end();
+  auto it = n.children.begin();
+  auto itEnd = n.children.end();
 
-	for (; it != itEnd; ++it) {
-		traverseNode(n2m, scene, *it, M);
-	}
+  for (; it != itEnd; ++it) {
+    traverseNode(n2m, scene, *it, M);
+  }
 }
 
 void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
-	totalNumPrimitives = 0;
+  totalNumPrimitives = 0;
 
-	std::map<std::string, BufferByte*> bufferViewDevPointers;
+  std::map<std::string, BufferByte*> bufferViewDevPointers;
 
-	// 1. copy all `bufferViews` to device memory
-	{
-		std::map<std::string, tinygltf::BufferView>::const_iterator it(
-			scene.bufferViews.begin());
-		std::map<std::string, tinygltf::BufferView>::const_iterator itEnd(
-			scene.bufferViews.end());
+  // 1. copy all `bufferViews` to device memory
+  {
+    std::map<std::string, tinygltf::BufferView>::const_iterator it(
+      scene.bufferViews.begin());
+    std::map<std::string, tinygltf::BufferView>::const_iterator itEnd(
+      scene.bufferViews.end());
 
-		for (; it != itEnd; it++) {
-			const std::string key = it->first;
-			const tinygltf::BufferView &bufferView = it->second;
-			if (bufferView.target == 0) {
-				continue; // Unsupported bufferView.
-			}
+    for (; it != itEnd; it++) {
+      const std::string key = it->first;
+      const tinygltf::BufferView &bufferView = it->second;
+      if (bufferView.target == 0) {
+        continue; // Unsupported bufferView.
+      }
 
-			const tinygltf::Buffer &buffer = scene.buffers.at(bufferView.buffer);
+      const tinygltf::Buffer &buffer = scene.buffers.at(bufferView.buffer);
 
-			BufferByte* dev_bufferView;
-			cudaMalloc(&dev_bufferView, bufferView.byteLength);
-			cudaMemcpy(dev_bufferView, &buffer.data.front() + bufferView.byteOffset, bufferView.byteLength, cudaMemcpyHostToDevice);
+      BufferByte* dev_bufferView;
+      cudaMalloc(&dev_bufferView, bufferView.byteLength);
+      cudaMemcpy(dev_bufferView, &buffer.data.front() + bufferView.byteOffset, bufferView.byteLength, cudaMemcpyHostToDevice);
 
-			checkCUDAError("Set BufferView Device Mem");
+      checkCUDAError("Set BufferView Device Mem");
 
-			bufferViewDevPointers.insert(std::make_pair(key, dev_bufferView));
+      bufferViewDevPointers.insert(std::make_pair(key, dev_bufferView));
 
-		}
-	}
+    }
+  }
 
 
 
-	// 2. for each mesh: 
-	//		for each primitive: 
-	//			build device buffer of indices, materail, and each attributes
-	//			and store these pointers in a map
-	{
-
-		std::map<std::string, glm::mat4> nodeString2Matrix;
-		auto rootNodeNamesList = scene.scenes.at(scene.defaultScene);
-
-		{
-			auto it = rootNodeNamesList.begin();
-			auto itEnd = rootNodeNamesList.end();
-			for (; it != itEnd; ++it) {
-				traverseNode(nodeString2Matrix, scene, *it, glm::mat4(1.0f));
-			}
-		}
+  // 2. for each mesh: 
+  //		for each primitive: 
+  //			build device buffer of indices, materail, and each attributes
+  //			and store these pointers in a map
+  {
 
+    std::map<std::string, glm::mat4> nodeString2Matrix;
+    auto rootNodeNamesList = scene.scenes.at(scene.defaultScene);
+
+    {
+      auto it = rootNodeNamesList.begin();
+      auto itEnd = rootNodeNamesList.end();
+      for (; it != itEnd; ++it) {
+        traverseNode(nodeString2Matrix, scene, *it, glm::mat4(1.0f));
+      }
+    }
 
-		// parse through node to access mesh
-
-		auto itNode = nodeString2Matrix.begin();
-		auto itEndNode = nodeString2Matrix.end();
-		for (; itNode != itEndNode; ++itNode) {
-
-			const tinygltf::Node & N = scene.nodes.at(itNode->first);
-			const glm::mat4 & matrix = itNode->second;
-			const glm::mat3 & matrixNormal = glm::transpose(glm::inverse(glm::mat3(matrix)));
-
-			auto itMeshName = N.meshes.begin();
-			auto itEndMeshName = N.meshes.end();
-
-			for (; itMeshName != itEndMeshName; ++itMeshName) {
-
-				const tinygltf::Mesh & mesh = scene.meshes.at(*itMeshName);
-
-				auto res = mesh2PrimitivesMap.insert(std::pair<std::string, std::vector<PrimitiveDevBufPointers>>(mesh.name, std::vector<PrimitiveDevBufPointers>()));
-				std::vector<PrimitiveDevBufPointers> & primitiveVector = (res.first)->second;
-
-				// for each primitive
-				for (size_t i = 0; i < mesh.primitives.size(); i++) {
-					const tinygltf::Primitive &primitive = mesh.primitives[i];
-
-					if (primitive.indices.empty())
-						return;
-
-					// TODO: add new attributes for your PrimitiveDevBufPointers when you add new attributes
-					VertexIndex* dev_indices = NULL;
-					VertexAttributePosition* dev_position = NULL;
-					VertexAttributeNormal* dev_normal = NULL;
-					VertexAttributeTexcoord* dev_texcoord0 = NULL;
-
-					// ----------Indices-------------
-
-					const tinygltf::Accessor &indexAccessor = scene.accessors.at(primitive.indices);
-					const tinygltf::BufferView &bufferView = scene.bufferViews.at(indexAccessor.bufferView);
-					BufferByte* dev_bufferView = bufferViewDevPointers.at(indexAccessor.bufferView);
-
-					// assume type is SCALAR for indices
-					int n = 1;
-					int numIndices = indexAccessor.count;
-					int componentTypeByteSize = sizeof(VertexIndex);
-					int byteLength = numIndices * n * componentTypeByteSize;
-
-					dim3 numThreadsPerBlock(128);
-					dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-					cudaMalloc(&dev_indices, byteLength);
-					_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-						numIndices,
-						(BufferByte*)dev_indices,
-						dev_bufferView,
-						n,
-						indexAccessor.byteStride,
-						indexAccessor.byteOffset,
-						componentTypeByteSize);
-
-
-					checkCUDAError("Set Index Buffer");
-
-
-					// ---------Primitive Info-------
-
-					// Warning: LINE_STRIP is not supported in tinygltfloader
-					int numPrimitives;
-					PrimitiveType primitiveType;
-					switch (primitive.mode) {
-					case TINYGLTF_MODE_TRIANGLES:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices / 3;
-						break;
-					case TINYGLTF_MODE_TRIANGLE_STRIP:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices - 2;
-						break;
-					case TINYGLTF_MODE_TRIANGLE_FAN:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices - 2;
-						break;
-					case TINYGLTF_MODE_LINE:
-						primitiveType = PrimitiveType::Line;
-						numPrimitives = numIndices / 2;
-						break;
-					case TINYGLTF_MODE_LINE_LOOP:
-						primitiveType = PrimitiveType::Line;
-						numPrimitives = numIndices + 1;
-						break;
-					case TINYGLTF_MODE_POINTS:
-						primitiveType = PrimitiveType::Point;
-						numPrimitives = numIndices;
-						break;
-					default:
-						// output error
-						break;
-					};
-
-
-					// ----------Attributes-------------
-
-					auto it(primitive.attributes.begin());
-					auto itEnd(primitive.attributes.end());
-
-					int numVertices = 0;
-					// for each attribute
-					for (; it != itEnd; it++) {
-						const tinygltf::Accessor &accessor = scene.accessors.at(it->second);
-						const tinygltf::BufferView &bufferView = scene.bufferViews.at(accessor.bufferView);
-
-						int n = 1;
-						if (accessor.type == TINYGLTF_TYPE_SCALAR) {
-							n = 1;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC2) {
-							n = 2;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC3) {
-							n = 3;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC4) {
-							n = 4;
-						}
-
-						BufferByte * dev_bufferView = bufferViewDevPointers.at(accessor.bufferView);
-						BufferByte ** dev_attribute = NULL;
-
-						numVertices = accessor.count;
-						int componentTypeByteSize;
-
-						// Note: since the type of our attribute array (dev_position) is static (float32)
-						// We assume the glTF model attribute type are 5126(FLOAT) here
-
-						if (it->first.compare("POSITION") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributePosition) / n;
-							dev_attribute = (BufferByte**)&dev_position;
-						}
-						else if (it->first.compare("NORMAL") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributeNormal) / n;
-							dev_attribute = (BufferByte**)&dev_normal;
-						}
-						else if (it->first.compare("TEXCOORD_0") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributeTexcoord) / n;
-							dev_attribute = (BufferByte**)&dev_texcoord0;
-						}
-
-						std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
-
-						dim3 numThreadsPerBlock(128);
-						dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-						int byteLength = numVertices * n * componentTypeByteSize;
-						cudaMalloc(dev_attribute, byteLength);
-
-						_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-							n * numVertices,
-							*dev_attribute,
-							dev_bufferView,
-							n,
-							accessor.byteStride,
-							accessor.byteOffset,
-							componentTypeByteSize);
-
-						std::string msg = "Set Attribute Buffer: " + it->first;
-						checkCUDAError(msg.c_str());
-					}
-
-					// malloc for VertexOut
-					VertexOut* dev_vertexOut;
-					cudaMalloc(&dev_vertexOut, numVertices * sizeof(VertexOut));
-					checkCUDAError("Malloc VertexOut Buffer");
-
-					// ----------Materials-------------
-
-					// You can only worry about this part once you started to 
-					// implement textures for your rasterizer
-					TextureData* dev_diffuseTex = NULL;
-					int diffuseTexWidth = 0;
-					int diffuseTexHeight = 0;
-					if (!primitive.material.empty()) {
-						const tinygltf::Material &mat = scene.materials.at(primitive.material);
-						printf("material.name = %s\n", mat.name.c_str());
-
-						if (mat.values.find("diffuse") != mat.values.end()) {
-							std::string diffuseTexName = mat.values.at("diffuse").string_value;
-							if (scene.textures.find(diffuseTexName) != scene.textures.end()) {
-								const tinygltf::Texture &tex = scene.textures.at(diffuseTexName);
-								if (scene.images.find(tex.source) != scene.images.end()) {
-									const tinygltf::Image &image = scene.images.at(tex.source);
-
-									size_t s = image.image.size() * sizeof(TextureData);
-									cudaMalloc(&dev_diffuseTex, s);
-									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
-									
-									diffuseTexWidth = image.width;
-									diffuseTexHeight = image.height;
-
-									checkCUDAError("Set Texture Image data");
-								}
-							}
-						}
-
-						// TODO: write your code for other materails
-						// You may have to take a look at tinygltfloader
-						// You can also use the above code loading diffuse material as a start point 
-					}
-
-
-					// ---------Node hierarchy transform--------
-					cudaDeviceSynchronize();
-					
-					dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-					_nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> > (
-						numVertices,
-						dev_position,
-						dev_normal,
-						matrix,
-						matrixNormal);
-
-					checkCUDAError("Node hierarchy transformation");
-
-					// at the end of the for loop of primitive
-					// push dev pointers to map
-					primitiveVector.push_back(PrimitiveDevBufPointers{
-						primitive.mode,
-						primitiveType,
-						numPrimitives,
-						numIndices,
-						numVertices,
-
-						dev_indices,
-						dev_position,
-						dev_normal,
-						dev_texcoord0,
-
-						dev_diffuseTex,
-						diffuseTexWidth,
-						diffuseTexHeight,
-
-						dev_vertexOut	//VertexOut
-					});
-
-					totalNumPrimitives += numPrimitives;
-
-				} // for each primitive
-
-			} // for each mesh
-
-		} // for each node
-
-	}
-	
-
-	// 3. Malloc for dev_primitives
-	{
-		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
-	}
-	
-
-	// Finally, cudaFree raw dev_bufferViews
-	{
-
-		std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
-		std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
-			
-			//bufferViewDevPointers
-
-		for (; it != itEnd; it++) {
-			cudaFree(it->second);
-		}
-
-		checkCUDAError("Free BufferView Device Mem");
-	}
+
+    // parse through node to access mesh
+
+    auto itNode = nodeString2Matrix.begin();
+    auto itEndNode = nodeString2Matrix.end();
+    for (; itNode != itEndNode; ++itNode) {
+
+      const tinygltf::Node & N = scene.nodes.at(itNode->first);
+      const glm::mat4 & matrix = itNode->second;
+      const glm::mat3 & matrixNormal = glm::transpose(glm::inverse(glm::mat3(matrix)));
+
+      auto itMeshName = N.meshes.begin();
+      auto itEndMeshName = N.meshes.end();
+
+      for (; itMeshName != itEndMeshName; ++itMeshName) {
+
+        const tinygltf::Mesh & mesh = scene.meshes.at(*itMeshName);
+
+        auto res = mesh2PrimitivesMap.insert(std::pair<std::string, std::vector<PrimitiveDevBufPointers>>(mesh.name, std::vector<PrimitiveDevBufPointers>()));
+        std::vector<PrimitiveDevBufPointers> & primitiveVector = (res.first)->second;
+
+        // for each primitive
+        for (size_t i = 0; i < mesh.primitives.size(); i++) {
+          const tinygltf::Primitive &primitive = mesh.primitives[i];
+
+          if (primitive.indices.empty())
+            return;
+
+          VertexIndex* dev_indices;
+          VertexAttributePosition* dev_position;
+          VertexAttributeNormal* dev_normal;
+          VertexAttributeTexcoord* dev_texcoord0;
+
+
+          // ----------Indices-------------
+
+          const tinygltf::Accessor &indexAccessor = scene.accessors.at(primitive.indices);
+          const tinygltf::BufferView &bufferView = scene.bufferViews.at(indexAccessor.bufferView);
+          BufferByte* dev_bufferView = bufferViewDevPointers.at(indexAccessor.bufferView);
+
+          // assume type is SCALAR for indices
+          int n = 1;
+          int numIndices = indexAccessor.count;
+          int componentTypeByteSize = sizeof(VertexIndex);
+          int byteLength = numIndices * n * componentTypeByteSize;
+
+          dim3 numThreadsPerBlock(128);
+          dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+          cudaMalloc(&dev_indices, byteLength);
+          _deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
+            numIndices,
+            (BufferByte*)dev_indices,
+            dev_bufferView,
+            n,
+            indexAccessor.byteStride,
+            indexAccessor.byteOffset,
+            componentTypeByteSize);
+
+
+          checkCUDAError("Set Index Buffer");
+
+
+          // ---------Primitive Info-------
+
+          // Warning: LINE_STRIP is not supported in tinygltfloader
+          int numPrimitives;
+          PrimitiveType primitiveType;
+          switch (primitive.mode) {
+          case TINYGLTF_MODE_TRIANGLES:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices / 3;
+            break;
+          case TINYGLTF_MODE_TRIANGLE_STRIP:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices - 2;
+            break;
+          case TINYGLTF_MODE_TRIANGLE_FAN:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices - 2;
+            break;
+          case TINYGLTF_MODE_LINE:
+            primitiveType = PrimitiveType::Line;
+            numPrimitives = numIndices / 2;
+            break;
+          case TINYGLTF_MODE_LINE_LOOP:
+            primitiveType = PrimitiveType::Line;
+            numPrimitives = numIndices + 1;
+            break;
+          case TINYGLTF_MODE_POINTS:
+            primitiveType = PrimitiveType::Point;
+            numPrimitives = numIndices;
+            break;
+          default:
+            // output error
+            break;
+          };
+
+
+          // ----------Attributes-------------
+
+          auto it(primitive.attributes.begin());
+          auto itEnd(primitive.attributes.end());
+
+          int numVertices = 0;
+          // for each attribute
+          for (; it != itEnd; it++) {
+            const tinygltf::Accessor &accessor = scene.accessors.at(it->second);
+            const tinygltf::BufferView &bufferView = scene.bufferViews.at(accessor.bufferView);
+
+            int n = 1;
+            if (accessor.type == TINYGLTF_TYPE_SCALAR) {
+              n = 1;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC2) {
+              n = 2;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC3) {
+              n = 3;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC4) {
+              n = 4;
+            }
+
+            BufferByte * dev_bufferView = bufferViewDevPointers.at(accessor.bufferView);
+            BufferByte ** dev_attribute = NULL;
+
+            numVertices = accessor.count;
+            int componentTypeByteSize;
+
+            // Note: since the type of our attribute array (dev_position) is static (float32)
+            // We assume the glTF model attribute type are 5126(FLOAT) here
+
+            if (it->first.compare("POSITION") == 0) {
+              componentTypeByteSize = sizeof(VertexAttributePosition) / n;
+              dev_attribute = (BufferByte**)&dev_position;
+            }
+            else if (it->first.compare("NORMAL") == 0) {
+              componentTypeByteSize = sizeof(VertexAttributeNormal) / n;
+              dev_attribute = (BufferByte**)&dev_normal;
+            }
+            else if (it->first.compare("TEXCOORD_0") == 0) {
+              componentTypeByteSize = sizeof(VertexAttributeTexcoord) / n;
+              dev_attribute = (BufferByte**)&dev_texcoord0;
+            }
+
+            std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
+
+            dim3 numThreadsPerBlock(128);
+            dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+            int byteLength = numVertices * n * componentTypeByteSize;
+            cudaMalloc(dev_attribute, byteLength);
+
+            _deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
+              n * numVertices,
+              *dev_attribute,
+              dev_bufferView,
+              n,
+              accessor.byteStride,
+              accessor.byteOffset,
+              componentTypeByteSize);
+
+            std::string msg = "Set Attribute Buffer: " + it->first;
+            checkCUDAError(msg.c_str());
+          }
+
+          // malloc for VertexOut
+          VertexOut* dev_vertexOut;
+          cudaMalloc(&dev_vertexOut, numVertices * sizeof(VertexOut));
+          checkCUDAError("Malloc VertexOut Buffer");
+
+          // ----------Materials-------------
+
+          // You can only worry about this part once you started to 
+          // implement textures for your rasterizer
+          TextureData* dev_diffuseTex = NULL;
+#if TEXTURE_MAP == 1
+          int texWidth = 0;
+          int texHeight = 0;
+          int texComp = 0;
+#endif
+          if (!primitive.material.empty()) {
+            const tinygltf::Material &mat = scene.materials.at(primitive.material);
+            printf("material.name = %s\n", mat.name.c_str());
+
+            if (mat.values.find("diffuse") != mat.values.end()) {
+              std::string diffuseTexName = mat.values.at("diffuse").string_value;
+              if (scene.textures.find(diffuseTexName) != scene.textures.end()) {
+                const tinygltf::Texture &tex = scene.textures.at(diffuseTexName);
+                if (scene.images.find(tex.source) != scene.images.end()) {
+                  const tinygltf::Image &image = scene.images.at(tex.source);
+
+                  size_t s = image.image.size() * sizeof(TextureData);
+                  cudaMalloc(&dev_diffuseTex, s);
+                  cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
+
+#if TEXTURE_MAP == 1
+                  texWidth = image.width;
+                  texHeight = image.height;
+                  texComp = image.component;
+#endif
+
+                  checkCUDAError("Set Texture Image data");
+                }
+              }
+            }
+
+            // TODO: write your code for other materails
+            // You may have to take a look at tinygltfloader
+            // You can also use the above code loading diffuse material as a start point 
+          }
+
+
+          // ---------Node hierarchy transform--------
+          cudaDeviceSynchronize();
+
+          dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+          _nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> > (
+            numVertices,
+            dev_position,
+            dev_normal,
+            matrix,
+            matrixNormal);
+
+          checkCUDAError("Node hierarchy transformation");
+
+          // at the end of the for loop of primitive
+          // push dev pointers to map
+          primitiveVector.push_back(PrimitiveDevBufPointers{
+            primitive.mode,
+            primitiveType,
+            numPrimitives,
+            numIndices,
+            numVertices,
+
+            dev_indices,
+            dev_position,
+            dev_normal,
+            dev_texcoord0,
+#if TEXTURE_MAP == 1
+            dev_diffuseTex,
+            texWidth,
+            texHeight,
+            texComp,
+#endif
+
+            dev_vertexOut	//VertexOut
+          });
+
+          totalNumPrimitives += numPrimitives;
+
+        } // for each primitive
+
+      } // for each mesh
+
+    } // for each node
+
+  }
+
+
+  // 3. Malloc for dev_primitives
+  {
+    cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
+  }
 
 
+  // Finally, cudaFree raw dev_bufferViews
+  {
+
+    std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
+    std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
+
+    //bufferViewDevPointers
+
+    for (; it != itEnd; it++) {
+      cudaFree(it->second);
+    }
+    checkCUDAError("Free BufferView Device Mem");
+  }
+
 }
 
 
 
-__global__ 
+__global__
 void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
-
-	// vertex id
-	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-
-		// TODO: Apply vertex transformation here
-		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
-		// Finally transform x and y to viewport space
-
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
-		
-	}
+int numVertices,
+PrimitiveDevBufPointers primitive,
+glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal,
+int width, int height) {
+
+  // vertex id
+  int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (vid < numVertices) {
+    VertexOut & vout = primitive.dev_verticesOut[vid];
+    VertexAttributePosition & vpos = primitive.dev_position[vid];
+    // Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
+    // Then divide the pos by its w element to transform into NDC space
+    // Finally transform x and y to viewport space
+    vout.pos = MVP * glm::vec4(vpos, 1.0f);
+    if (fabs(vout.pos.w) > EPSILON) vout.pos /= vout.pos.w;
+    vout.pos.x = 0.5f * (float)width * (vout.pos.x + 1.0f);
+    vout.pos.y = 0.5f * (float)height * (vout.pos.y + 1.0f);
+
+    // Assemble all attribute arraies into the primitive array
+    VertexAttributeNormal & vnorm = primitive.dev_normal[vid];
+    glm::vec4 eyePos = MV * glm::vec4(vpos, 1.0f);
+    if (fabs(eyePos.w) > EPSILON) vout.eyePos = glm::vec3(eyePos / eyePos.w);
+    vout.eyeNor = glm::normalize(MV_normal * vnorm);
+
+#if TEXTURE_MAP == 1
+    //Textures
+    if (primitive.dev_diffuseTex != NULL) {
+      vout.texcoord0 = primitive.dev_texcoord0[vid];
+    }
+    vout.dev_diffuseTex = primitive.dev_diffuseTex;
+    vout.texWidth = primitive.texWidth;
+    vout.texHeight = primitive.texHeight;
+    vout.texComp = primitive.texComp;
+#endif
+  }
 }
 
 
 
 static int curPrimitiveBeginId = 0;
 
-__global__ 
+__global__
 void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
 
-	// index id
-	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  // index id
+  int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (iid < numIndices) {
+  if (iid < numIndices) {
+    // This is primitive assembly for triangles
+    int pid;	// id for cur primitives vector
+    if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+      pid = iid / (int)primitive.primitiveType;
+      dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+        = primitive.dev_verticesOut[primitive.dev_indices[iid]];
+    }
 
-		// TODO: uncomment the following code for a start
-		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+    // TODO: other primitive types (point, line)
+  }
+
+}
 
+__device__ __host__
+int clamp_int(int mn, int x, int mx) {
+  if (x > mx) return mx;
+  if (x < mn) return mn;
+  return x;
+}
 
-		// TODO: other primitive types (point, line)
-	}
-	
+__device__ __host__
+glm::vec3 getPixel(int x, int y, int width, int height, int components, TextureData * tex) {
+  if (x >= width || y >= height || x < 0 || y < 0) {
+    return glm::vec3(0, 0, 0);
+  }
+  int texIdx = y * width + x;
+  return (1.0f / 255.0f) * glm::vec3(tex[components * texIdx], tex[components * texIdx + 1], tex[components * texIdx + 2]);
 }
 
+__global__
+void kernRasterize(int numPrimitives, Primitive* dev_primitives,
+int width, int height, Fragment* fragmentBuffer, FragmentMutex* mutexes) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index < numPrimitives) {
+    Primitive & p = dev_primitives[index];
+    VertexOut & firstVertex = p.v[0];
+    glm::vec3 triangle[3] = { glm::vec3(p.v[0].pos), glm::vec3(p.v[1].pos), glm::vec3(p.v[2].pos) };
+    AABB boundingBox = getAABBForTriangle(triangle);
+    int minxpix = clamp_int(0, boundingBox.min.x, width - 1);
+    int minypix = clamp_int(0, boundingBox.min.y, height - 1);
+    int maxxpix = clamp_int(0, boundingBox.max.x, width - 1);
+    int maxypix = clamp_int(0, boundingBox.max.y, height - 1);
+    for (int y = minypix; y <= maxypix; y++) {
+      for (int x = minxpix; x <= maxxpix; x++) {
+        int fragIdx = (height - 1 - y) * width + (width - 1 - x);
+        Fragment & fragment = fragmentBuffer[fragIdx];
+
+        glm::vec3 baryCoords = calculateBarycentricCoordinate(triangle, glm::vec2(x, y));
+        if (isBarycentricCoordInBounds(baryCoords)) {
+          float pos = glm::dot(baryCoords, glm::vec3(p.v[0].pos.z, p.v[1].pos.z, p.v[2].pos.z));
+          bool isSet;
+          do {
+            isSet = atomicCAS(&mutexes[fragIdx].mutex, 0, 1) == 0;
+            if (isSet) {
+              if (pos < fragment.z) {
+                fragment.z = pos;
+#if TEXTURE_MAP == 1
+                if (p.v[0].dev_diffuseTex == NULL) {
+                  fragment.color = glm::vec3(1.0f, 1.0f, 1.0f); // white
+                  fragment.diffuseTex = NULL;
+                }
+                else {
+#if PERSPECTIVE_CORRECT == 1
+                  glm::vec3 perspectiveBaryCoords = glm::vec3(baryCoords.x / p.v[0].eyePos.z, baryCoords.y / p.v[1].eyePos.z, baryCoords.z / p.v[2].eyePos.z);
+                  float scaleFactor = (1.0f / (perspectiveBaryCoords.x + perspectiveBaryCoords.y + perspectiveBaryCoords.z));
+                  fragment.texcoord0 = glm::mat3x2(p.v[0].texcoord0, p.v[1].texcoord0, p.v[2].texcoord0)
+                    * perspectiveBaryCoords * scaleFactor;
+#else
+                  fragment.texcoord0 = glm::mat3x2(p.v[0].texcoord0, p.v[1].texcoord0, p.v[2].texcoord0) * baryCoords;
+#endif
+                  fragment.texWidth = firstVertex.texWidth;
+                  fragment.texHeight = firstVertex.texHeight;
+                  fragment.texComp = firstVertex.texComp;
+                  fragment.diffuseTex = firstVertex.dev_diffuseTex;
+                }
+#else
+                fragment.color = glm::vec3(1.0f, 1.0f, 1.0f); // white
+#endif
+                fragment.eyePos = glm::mat3(p.v[0].eyePos, p.v[1].eyePos, p.v[2].eyePos) * baryCoords;
+#if NORMAL_INTERPOLATE == 1
+                fragment.eyeNor = glm::mat3(p.v[0].eyeNor, p.v[1].eyeNor, p.v[2].eyeNor) * baryCoords;
+#else
+                fragment.eyeNor = glm::normalize(glm::cross(
+                  glm::vec3(p.v[1].eyeNor - p.v[0].eyeNor),
+                  glm::vec3(p.v[2].eyeNor - p.v[0].eyeNor)
+                ));
+#endif
+                }
+              }
+            if (isSet) {
+              mutexes[fragIdx].mutex = 0;
+            }
+            } while (pos < fragment.z && !isSet);
+          }
+        }
+      }
+    }
+  }
+
+__global__
+void kernTextureShader(int width, int height, Fragment* fragmentBuffer) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * width);
+
+  if (x < width && y < height) {
+    Fragment & fragment = fragmentBuffer[index];
+    if (fragment.diffuseTex != NULL) {
+      float texx = 0.5f + fragment.texcoord0.x * (fragment.texWidth - 1);
+      float texy = 0.5f + fragment.texcoord0.y * (fragment.texHeight - 1);
+#if BILINEAR_INTERPOLATION == 1
+      float x1 = glm::floor(texx);
+      float y1 = glm::floor(texy);
+      glm::vec3 c11 = getPixel(x1, y1, fragment.texWidth, fragment.texHeight, fragment.texComp, fragment.diffuseTex);
+      glm::vec3 c12 = getPixel(x1, y1 + 1, fragment.texWidth, fragment.texHeight, fragment.texComp, fragment.diffuseTex);
+      glm::vec3 c21 = getPixel(x1 + 1, y1, fragment.texWidth, fragment.texHeight, fragment.texComp, fragment.diffuseTex);
+      glm::vec3 c22 = getPixel(x1 + 1, y1 + 1, fragment.texWidth, fragment.texHeight, fragment.texComp, fragment.diffuseTex);
+      glm::vec3 r1 = (texx - x1) * c21 + (1.0f + x1 - texx) * c11;
+      glm::vec3 r2 = (texx - x1) * c22 + (1.0f + x1 - texx) * c12;
+      fragment.color = (texy - y1) * r2 + (1.0f + y1 - texy) * r1;
+#else
+      fragment.color = getPixel(texx, texy, fragment.texWidth, fragment.texHeight, fragment.texComp, fragment.diffuseTex);
+#endif
+    }
+  }
+}
 
+struct IsBackfacing {
+  __host__ __device__ bool operator () (const Primitive & p) {
+    glm::vec3 normal = glm::normalize(glm::cross(
+      glm::vec3(p.v[1].pos - p.v[0].pos),
+      glm::vec3(p.v[2].pos - p.v[0].pos)));
+    return normal.z < -0;
+  }
+};
+
+__global__
+void calculateSobel(int w, int h, Fragment * fragmentBuffer) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+  float sobelKernel[3][3] = { { -1, 0, 1 }, { -2, 0, 2 }, { -1, 0, 1 } };
+  if (x < w && y < h) {
+    Fragment & fragment = fragmentBuffer[index];
+    for (int i = -1; i <= 1; i++) {
+      for (int j = -1; j <= 1; j++) {
+        if (x + i < w && x + i >= 0 && y + j < h && y + j >= 0) {
+          int sobelIdx = x + i + ((y + j) * w);
+          float dist = (fragmentBuffer[sobelIdx].z > 1e12) ? 1e12 : glm::length(fragmentBuffer[sobelIdx].eyePos);
+          fragment.sobelx += sobelKernel[i + 1][j + 1] * dist;
+          fragment.sobely += sobelKernel[j + 1][i + 1] * dist;
+        }
+      }
+    }
+  }
+}
+
+__global__
+void calculateSobelWithShared(int w, int h, Fragment * fragmentBuffer) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+  __shared__ float tile[SOBEL_GRID][SOBEL_GRID];
+  __shared__ float sobelx[SOBEL_GRID][SOBEL_GRID];
+  __shared__ float sobely[SOBEL_GRID][SOBEL_GRID];
+  float sobelKernel[3][3] = { { 3, 0, -3 }, { 10, 0, -10 }, { 3, 0, -3 } };
+  if (x < w && y < h) {
+    int bx = threadIdx.x;
+    int by = threadIdx.y;
+    Fragment & fragment = fragmentBuffer[index];
+    tile[bx][by] = (fragment.z > 1e12) ? 1e12 : glm::length(fragment.eyePos);
+    sobelx[bx][by] = 0;
+    sobely[bx][by] = 0;
+    __syncthreads();
+
+    for (int i = -1; i <= 1; i++) {
+      for (int j = -1; j <= 1; j++) {
+        if (bx + i < SOBEL_GRID && bx + i >= 0 && by + j < SOBEL_GRID && by + j >= 0) {
+          sobelx[bx][by] += sobelKernel[i + 1][j + 1] * tile[bx + i][by + j];
+          sobely[bx][by] += sobelKernel[j + 1][i + 1] * tile[bx + i][by + j];
+        }
+        else {
+          if (x + i < w && x + i >= 0 && y + j < h && y + j >= 0) {
+            int sobelIdx = x + i + ((y + j) * w);
+            float dist = (fragmentBuffer[sobelIdx].z > 1e12) ? 1e12 : glm::length(fragmentBuffer[sobelIdx].eyePos);
+            sobelx[bx][by] += sobelKernel[i + 1][j + 1] * dist;
+            sobely[bx][by] += sobelKernel[j + 1][i + 1] * dist;
+          }
+        }
+      }
+    }
+    fragment.sobelx = sobelx[bx][by];
+    fragment.sobely = sobely[bx][by];
+  }
+}
+
+/**
+* Writes fragment colors to the framebuffer
+*/
+__global__
+void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer, int numLights, Light *lights) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+
+  if (x < w && y < h) {
+    Fragment & fragment = fragmentBuffer[index];
+    if (fragment.z < 1e12) {
+      float totalLight = AMBIENT_LIGHT;
+
+      // Lambert shading
+      for (int i = 0; i < numLights; i++) {
+        Light & light = lights[i];
+        totalLight += light.emittance * glm::max(0.0f,
+          glm::dot(fragment.eyeNor, glm::normalize(light.eyePos - fragment.eyePos)));
+      }
+      framebuffer[index] = totalLight * fragment.color;
+#if CEL_SHADE > 0
+      framebuffer[index] = glm::ceil(framebuffer[index] * (float)CEL_SHADE) / (float)CEL_SHADE;
+      float sobel = glm::sqrt(fragment.sobelx * fragment.sobelx + fragment.sobely * fragment.sobely);
+      if (sobel > 15.0f) framebuffer[index] = glm::vec3(0.0f, 0.0f, 0.0f);
+#endif
+    }
+    else {
+      framebuffer[index] = glm::vec3(0.5f, 0.8f, 1.0f);
+    }
+  }
+}
 
 /**
  * Perform rasterization.
  */
 void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
-    int sideLength2d = 8;
-    dim3 blockSize2d(sideLength2d, sideLength2d);
-    dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
-		(height - 1) / blockSize2d.y + 1);
-
-	// Execute your rasterization pipeline here
-	// (See README for rasterization pipeline outline.)
-
-	// Vertex Process & primitive assembly
-	{
-		curPrimitiveBeginId = 0;
-		dim3 numThreadsPerBlock(128);
-
-		auto it = mesh2PrimitivesMap.begin();
-		auto itEnd = mesh2PrimitivesMap.end();
-
-		for (; it != itEnd; ++it) {
-			auto p = (it->second).begin();	// each primitive
-			auto pEnd = (it->second).end();
-			for (; p != pEnd; ++p) {
-				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
-				checkCUDAError("Vertex Processing");
-				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
-					curPrimitiveBeginId, 
-					dev_primitives, 
-					*p);
-				checkCUDAError("Primitive Assembly");
-
-				curPrimitiveBeginId += p->numPrimitives;
-			}
-		}
-
-		checkCUDAError("Vertex Processing and Primitive Assembly");
-	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
-
-
-    // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
-	checkCUDAError("fragment shader");
-    // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
+  int sideLength2d = 8;
+  dim3 blockSize2d(sideLength2d, sideLength2d);
+  dim3 blockCount2d((width - 1) / blockSize2d.x + 1,
+    (height - 1) / blockSize2d.y + 1);
+
+  // Execute your rasterization pipeline here
+  // (See README for rasterization pipeline outline.)
+
+  // Vertex Process & primitive assembly
+  {
+    curPrimitiveBeginId = 0;
+    dim3 numThreadsPerBlock(128);
+
+    auto it = mesh2PrimitivesMap.begin();
+    auto itEnd = mesh2PrimitivesMap.end();
+
+    for (; it != itEnd; ++it) {
+      auto p = (it->second).begin();	// each primitive
+      auto pEnd = (it->second).end();
+      for (; p != pEnd; ++p) {
+        dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+        dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+
+        _vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+        checkCUDAError("Vertex Processing");
+        cudaDeviceSynchronize();
+        _primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
+          (p->numIndices,
+          curPrimitiveBeginId,
+          dev_primitives,
+          *p);
+        checkCUDAError("Primitive Assembly");
+
+        curPrimitiveBeginId += p->numPrimitives;
+      }
+    }
+
+    checkCUDAError("Vertex Processing and Primitive Assembly");
+  }
+
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  initMutexes << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentMutexes, dev_fragmentBuffer);
+  checkCUDAError("init mutexes");
+
+  int numPrimitives = totalNumPrimitives;
+
+  // Backface culling
+#if BACKFACE_CULL == 1
+  thrust::device_ptr<Primitive> dev_thrust_primitives(dev_primitives);
+  thrust::device_ptr<Primitive> dev_thrust_primitivesEnd =
+    thrust::remove_if(dev_thrust_primitives, dev_thrust_primitives + numPrimitives, IsBackfacing());
+  numPrimitives = dev_thrust_primitivesEnd - dev_thrust_primitives;
+  printf("%d triangles\n", numPrimitives);
+  checkCUDAError("backface culling");
+#endif
+
+
+  // Rasterization
+  dim3 numThreadsPerBlock(64);
+  dim3 numBlocksForPrimitives((numPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+  kernRasterize << < numBlocksForPrimitives, numThreadsPerBlock >> >(
+    numPrimitives, dev_primitives,
+    width, height, dev_fragmentBuffer, dev_fragmentMutexes);
+  checkCUDAError("rasterizer");
+
+  // Filling texture colors
+#if TEXTURE_MAP == 1
+  kernTextureShader << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer);
+  checkCUDAError("textureShader");
+#endif
+
+  // Offline light transformation, since there aren't many lights
+  for (Light & light : lights) {
+    glm::vec4 eyePos = MV * light.worldPos;
+    light.eyePos = glm::vec3(eyePos / eyePos.w);
+  }
+  cudaMemcpy(dev_lights, lights.data(), lights.size() * sizeof(Light), cudaMemcpyHostToDevice);
+
+#if CEL_SHADE > 0
+  dim3 sobelBlockSize2d(SOBEL_GRID, SOBEL_GRID);
+  dim3 sobelBlockCount2d((width - 1) / sobelBlockSize2d.x + 1,
+    (height - 1) / sobelBlockSize2d.y + 1);
+#if USE_SHARED_SOBEL == 1
+  calculateSobelWithShared<< <sobelBlockCount2d, sobelBlockSize2d >> >(width, height, dev_fragmentBuffer);
+#else
+  calculateSobel<< <sobelBlockCount2d, sobelBlockSize2d >> >(width, height, dev_fragmentBuffer);
+#endif
+  checkCUDAError("Sobel");
+#endif
+
+  // Copy depthbuffer colors into framebuffer
+  render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer,
+    lights.size(), dev_lights);
+  checkCUDAError("fragment shader");
+
+
+  // Copy framebuffer into OpenGL buffer for OpenGL previewing
+  sendImageToPBO << <blockCount2d, blockSize2d >> >(pbo, width, height, dev_framebuffer);
+  checkCUDAError("copy render result to pbo");
 }
 
 /**
@@ -739,38 +1054,37 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
  */
 void rasterizeFree() {
 
-    // deconstruct primitives attribute/indices device buffer
-
-	auto it(mesh2PrimitivesMap.begin());
-	auto itEnd(mesh2PrimitivesMap.end());
-	for (; it != itEnd; ++it) {
-		for (auto p = it->second.begin(); p != it->second.end(); ++p) {
-			cudaFree(p->dev_indices);
-			cudaFree(p->dev_position);
-			cudaFree(p->dev_normal);
-			cudaFree(p->dev_texcoord0);
-			cudaFree(p->dev_diffuseTex);
-
-			cudaFree(p->dev_verticesOut);
-
-			
-			//TODO: release other attributes and materials
-		}
-	}
+  // deconstruct primitives attribute/indices device buffer
+
+  auto it(mesh2PrimitivesMap.begin());
+  auto itEnd(mesh2PrimitivesMap.end());
+  for (; it != itEnd; ++it) {
+    for (auto p = it->second.begin(); p != it->second.end(); ++p) {
+      cudaFree(p->dev_indices);
+      cudaFree(p->dev_position);
+      cudaFree(p->dev_normal);
+      cudaFree(p->dev_texcoord0);
+#if TEXTURE_MAP == 1
+      cudaFree(p->dev_diffuseTex);
+#endif
+
+      cudaFree(p->dev_verticesOut);
+    }
+  }
 
-	////////////
+  ////////////
 
-    cudaFree(dev_primitives);
-    dev_primitives = NULL;
+  cudaFree(dev_primitives);
+  dev_primitives = NULL;
 
-	cudaFree(dev_fragmentBuffer);
-	dev_fragmentBuffer = NULL;
+  cudaFree(dev_fragmentBuffer);
+  dev_fragmentBuffer = NULL;
 
-    cudaFree(dev_framebuffer);
-    dev_framebuffer = NULL;
+  cudaFree(dev_framebuffer);
+  dev_framebuffer = NULL;
 
-	cudaFree(dev_depth);
-	dev_depth = NULL;
+  cudaFree(dev_lights);
+  dev_lights = NULL;
 
-    checkCUDAError("rasterize Free");
+  checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..6075d10 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -62,7 +62,9 @@ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c,
     baryTri[0] = glm::vec3(a, 0);
     baryTri[1] = glm::vec3(b, 0);
     baryTri[2] = glm::vec3(c, 0);
-    return calculateSignedArea(baryTri) / calculateSignedArea(tri);
+    float signedArea = calculateSignedArea(tri);
+    if (fabs(signedArea) < EPSILON) return -1.0f;
+    return calculateSignedArea(baryTri) / signedArea;
 }
 
 // CHECKITOUT