diff --git a/README.md b/README.md
index 41b91f0..ecd62fd 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,159 @@
 CUDA Rasterizer
+
 ===============
+    
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
+* Henry Zhu
+  * [Github](https://github.com/Maknee), [LinkedIn](https://www.linkedin.com/in/henry-zhu-347233121/), [personal website](https://maknee.github.io/), [twitter](https://twitter.com/maknees1), etc.
+* Tested on: Windows 10 Home, Intel i7-4710HQ @ 2.50GHz 22GB, GTX 870M (Own computer)
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+## Cuda Strike
+
+![](cuda_strike.gif)
+
+- * Reference to the First Person shooter [Counter Strike](https://en.wikipedia.org/wiki/Counter-Strike)
+- * Not very accurate of a first person shooter, but the enemies are ducks and pressing the left button deletes them :)
+
+![](truck1.png)
+![](truck2.png)
+![](truck3.png)
+![](truck4.png)
+![](truck5.png)
+
+## What is a rasterizer?
+
+A rasterizer is a method of rendering images by forming triangles and converting these triangles to pixels on the screen
+
+In more detail, a rasterizer has a so called graphics pipeline where points from an object's data file are converted to points in the screen space and triangles are formed from three points. From these triangles, an AABB check if performed around the triangles to see what approximate rectangular area the triangle takes up. Then, a scanline method is performed where the rasterizer iterates through all pixels in the AABB A barycentric calculation is performed to check if the pixel is actually in the triangle. If the pixel is, a depth test is checked and if that passes, the pixel(fragment)'s color can be modified from all cool effects such as colors, lights, etc.
+
+## Building / Usage
+
+### Building
+- Build using cmake or cmake-gui
+- Open sln project in visual studio
+
+### Running
+
+- ./cis565_rasterizer.exe ..\gltfs\duck\duck.gltf
+
+## Basic features
+- Entire graphics pipeline
+- Lambert shading and blinn-phong shading
+- Depth testing
+
+## What advanced features of my rasterizer implement?
+
+- The rasterizer can be actually used to play a game and loads Multiple objects :)
+- UV texture mapping with bilinear texture filtering and perspective correct texture coordinates
+- Correct color interpolation between points on a primitive
+- Backface culling, optimized using stream compaction (with thrust)
+
+## UV texture mapping
+
+### UV texture mapping
+
+![](duck_map.png)
+
+### UV texture mapping evaluation
+
+This feature consists of mapping textures onto an object with bilinear filtering and perspective correct texture coordinates. 
+
+Bilinear filtering is sampiling four (u,v)s on a texture and mixing the two based on where the u,v is mapped to on the pixel.
+
+Perspective correct texture coordinates can be done by using a combination of barycentric coordinates and eye coordinates to change the texture color to map to the correct values based on eye position.
+
+#### Performance
+
+The performance did not take too much of a hit. The fps decreased by about a fifth compared to just using colors, but fps is measured as the inverse of the time per frame. The bilinear filtering usage vs not using bilinear filtering causes a slight performance hit. 
+
+#### Future optimizations
+
+An optimization is to precompute the texture values color mapping to the object's triangle points, so that when the camera is moved, all that is needed to be done to color perspective correwct the texture color values. This takes up space, however.
+
+Another way to optimize is not move the camera and save all the texture colors in a buffer that changes when the camera is moved. When the camera is not moving, there is no work to be done.  
+
+## Correct color interpolation between points on a primitive
+
+![](triangle_color.png)
+![](checkerboard_color.png)
+
+### Color interpolation evaluation 
+
+Color interpolation is done by correct the color based on baycentric coordinates and the eye's position. This is like mixing the colors between different points as shown in the triangle image.
+
+#### Performance
+
+The performance is not noticable at all. All that was done was add a couple more multiplications and divisions to correctly interpolate color between points
+
+#### Future optimizations
+
+There are not many optimizations as this is multiplying baycentric coordinates to the pixel value and camera values, but one way to speed things up would be to add a cache that maps the baycentric value to different points on a triangle and to compute the pixel value quickly by fetching the baycentric value quickly.
+
+- Backface culling, optimized using stream compaction (with thrust)
+
+![](backface-culling-graph.png)
+
+### Backface culling evaluation
+
+Backface culling is the technique of removing faces that are not facing towards the camera. This can be done by getting the triangles' normals and dotting that with the camera forward vector. With stream compaction, the primitives can be removed from being calcuated for the screen since the triangles are facing away from the user.
+
+#### Performance
+
+This was a performance hit since triangles facing away from the user are not used in computation and in the end, thrown away. This resulted in a 2x speedup for the duck since half of its triangles were facing away from the camera.
+
+#### Future optimizations
+
+This is an optimization already since this technique removed triangles not facing the same direction as the user. 
+
+But, a way to speed things up is to keep a track of triangles not facing the user and cache that, so stream compaction is used on a smaller part of the total number of primitives
+
+- Performance Analysis 
+
+Chart of each pipeline stage
+
+![](performance.png)
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Data in nano seconds
 
-### (TODO: Your README)
+```
+    Triangle
+Vertex Assembly and Shader  236885
+Primitive Assembly  80467
+Rasterizer  1826779
+Rasterizer with texture mapping 0
+Rasterizer with backface culling    1638904
+Fragment shader 1132289
+    
+    Duck
+Vertex Assembly and Shader  289846
+Primitive Assembly  160934
+Rasterizer  0
+Rasterizer with texture mapping 3824242
+Rasterizer with backface culling    3632737
+Fragment shader 1105604
+    
+    Cow
+Vertex Assembly and Shader  534122
+Primitive Assembly  163808
+Rasterizer  6801853
+Rasterizer with texture mapping 0
+Rasterizer with backface culling    6472184
+Fragment shader 1110119
+    
+    Flower
+Vertex Assembly and Shader  468434
+Primitive Assembly  149029
+Rasterizer  4718007
+Rasterizer with texture mapping 0
+Rasterizer with backface culling    4325114
+Fragment shader 1165544
+```
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+## Third party usage
 
+### Book
+- [PBRT](https://www.pbrt.org/)
 
 ### Credits
 
diff --git a/backface-culling-graph.png b/backface-culling-graph.png
new file mode 100644
index 0000000..cb154bf
Binary files /dev/null and b/backface-culling-graph.png differ
diff --git a/checkerboard_color.png b/checkerboard_color.png
new file mode 100644
index 0000000..9acba4d
Binary files /dev/null and b/checkerboard_color.png differ
diff --git a/cuda_strike.gif b/cuda_strike.gif
new file mode 100644
index 0000000..53f01af
Binary files /dev/null and b/cuda_strike.gif differ
diff --git a/duck_map.png b/duck_map.png
new file mode 100644
index 0000000..494f916
Binary files /dev/null and b/duck_map.png differ
diff --git a/performance.png b/performance.png
new file mode 100644
index 0000000..11c787e
Binary files /dev/null and b/performance.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..9fa5cdb 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -6,20 +6,28 @@
  * @copyright University of Pennsylvania
  */
 
-
+//print clock time
+//#define PRINT_CLOCK 1
 
 #include "main.hpp"
 
 #define STB_IMAGE_IMPLEMENTATION
 #define TINYGLTF_LOADER_IMPLEMENTATION
 #include <util/tiny_gltf_loader.h>
+#include <random>
+
+#define CUDA_STRIKE 1
+const int object_copies = 10;
 
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
 
+//scenes cuz i'm lazy (instead of parsing each primitive)
+std::vector<tinygltf::Scene> scenes;
+
 int main(int argc, char **argv) {
-    if (argc != 2) {
+    if (argc == 1) {
         cout << "Usage: [gltf file]. Press Enter to exit" << endl;
 		getchar();
         return 0;
@@ -28,26 +36,31 @@ int main(int argc, char **argv) {
 	tinygltf::Scene scene;
 	tinygltf::TinyGLTFLoader loader;
 	std::string err;
-	std::string input_filename(argv[1]);
-	std::string ext = getFilePathExtension(input_filename);
-
-	bool ret = false;
-	if (ext.compare("glb") == 0) {
-		// assume binary glTF.
-		ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
-	} else {
-		// assume ascii glTF.
-		ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
+	std::vector<std::string> filenames(argv + 1, argv + argc);
+	for(auto& input_filename : filenames)
+	{
+	  std::string ext = getFilePathExtension(input_filename);
+
+	  bool ret = false;
+	  if (ext.compare("glb") == 0) {
+		  // assume binary glTF.
+		  ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
+	  } else {
+		  // assume ascii glTF.
+		  ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
+	  }
+	  if (!ret) {
+		  printf("Failed to parse glTF: %s\n", input_filename.c_str());
+		  return -1;
+	  }
+	  //push our scene back
+	  scenes.emplace_back(scene);
 	}
 
 	if (!err.empty()) {
 		printf("Err: %s\n", err.c_str());
 	}
 
-	if (!ret) {
-		printf("Failed to parse glTF\n");
-		return -1;
-	}
 
 
     frame = 0;
@@ -99,31 +112,140 @@ void mainLoop() {
 float scale = 1.0f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
-void runCuda() {
+
+//camera stuff
+glm::vec3 camera_pos = glm::vec3(0.0f, 0.0f, 5.0f);
+glm::vec3 camera_front = glm::vec3(0.0f, 0.0f, -1.0f);
+glm::vec3 camera_up = glm::vec3(0.0f, 1.0f, 0.0f);
+glm::vec3 camera_right = glm::normalize(glm::cross(camera_front, camera_up));
+float camera_speed = 0.10f;
+
+void runCuda()
+{
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
     dptr = NULL;
 
-	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
-		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+    //movement
+    if (glfwGetKey(window, GLFW_KEY_W)) 
+    {
+        camera_pos += camera_speed * camera_front;
+    }
+    if (glfwGetKey(window, GLFW_KEY_S)) 
+    {
+    	camera_pos -= camera_speed * camera_front;
+    }
+    if (glfwGetKey(window, GLFW_KEY_D)) 
+    {
+    	camera_pos -= camera_right * camera_speed;
+    }
+    if (glfwGetKey(window, GLFW_KEY_A)) 
+    {   
+      	camera_pos += camera_right * camera_speed;
+    }
 
-	glm::mat4 V = glm::mat4(1.0f);
+    //don't move up or down
+    camera_pos.y = 0.0f;
 
-	glm::mat4 M =
-		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
-		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+#ifndef CUDA_STRIKE
 
+	//zero out frame buffer
+    zero_frame_buffer();
+	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
+	 	scale * ((float)width / (float)height),
+	 	-scale, scale, 1.0, 1000.0);
+ 
+	 glm::mat4 V = glm::mat4(1.0f);
+ 
+	 glm::mat4 M =
+	 	glm::translate(glm::vec3(x_trans, y_trans, z_trans))
+	 	* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
+	 	* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
 	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
 	glm::mat4 MV = V * M;
 	glm::mat4 MVP = P * MV;
+    cudaGLMapBufferObject((void **)&dptr, pbo);
+
+	rasterize(dptr, MVP, MV, MV_normal, camera_pos);
+    
+    write_to_pbo(dptr);
+    cudaGLUnmapBufferObject(pbo);  
+#endif
 
+#ifdef CUDA_STRIKE
     cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
+
+	//zero out frame buffer
+    zero_frame_buffer();
+
+    for(int i = 0; i < objects.size(); i++)
+    {
+	//grab current object and set the scene
+	ObjectData& object_data = objects[i];
+    	set_scene(i);
+
+	if(object_data.is_deleted)
+	{
+		//spawn
+		object_data.is_deleted = false;
+		std::mt19937 rng;
+		std::random_device rd{};
+		rng.seed(rd());
+		std::uniform_real_distribution<> dist(20.0f, 50.0f);
+		object_data.transformation = glm::vec3(dist(rng), 0.0f, -dist(rng));
+		//continue;
+	}
+
+    	glm::mat4 P = glm::perspective(glm::radians(45.0f), static_cast<float>(width) / static_cast<float>(height), 1.0f,
+	                               1000.0f);
+	glm::mat4 V = glm::lookAt(camera_pos, camera_pos + camera_front, camera_up);
+
+	glm::vec3& object_transform = object_data.transformation;
+	glm::mat4 M;
+	glm::vec3 camera_to_object = camera_pos - object_transform;
+    	//M = glm::rotate(M, glm::atan(camera_to_object.y, glm::sqrt(camera_to_object.x * camera_to_object.x + camera_to_object.z * camera_to_object.z)), glm::vec3(0.0f, 1.0f, 0.0f));
+	//M = glm::scale(M, glm::vec3(1.0f));
+	//M = glm::translate(M, object_transform);
+
+	//move towards camera
+	if(glm::length(camera_to_object) > 15.0f)
+	{
+		object_transform += camera_to_object * 0.01f;
+		M = glm::translate(M, object_transform);
+	}
+
+    	//object look at camera
+    	M = glm::inverse(glm::lookAt(object_transform, camera_pos, camera_up));
+    	M = glm::rotate(M, glm::radians(90.0f), glm::vec3(0.0f, 1.0f, 0.0f));
+
+	//check if hit (not accurate)
+	if(glfwGetMouseButton(window, GLFW_MOUSE_BUTTON_LEFT))
+	{
+		float angle = glm::dot(camera_front, -camera_to_object);
+		//std::cout << angle << "\n";
+		float threshold = 0.2f;
+		float middle = 15.0f;
+		if(angle < threshold + middle  && angle > -threshold + middle)
+		{
+			//destroy object
+			object_data.is_deleted = true;
+		}
+	}
+
+	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
+	glm::mat4 MV = V * M;
+	glm::mat4 MVP = P * MV;
+
+	rasterize(dptr, MVP, MV, MV_normal, camera_pos);
+    }
+    write_to_pbo(dptr);
     cudaGLUnmapBufferObject(pbo);
+#endif
 
-    frame++;
+#ifdef PRINT_CLOCK
+     system("pause");
+#endif
+	frame++;
     fpstracker++;
 }
 
@@ -146,7 +268,11 @@ bool init(const tinygltf::Scene & scene) {
         return false;
     }
     glfwMakeContextCurrent(window);
-    glfwSetKeyCallback(window, keyCallback);
+#ifdef CUDA_STRIKE
+	glfwSetKeyCallback(window, keyCallback);
+#endif
+	//disable mouse
+     glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_DISABLED);
 
     // Set up GL context
     glewExperimental = GL_TRUE;
@@ -180,8 +306,25 @@ bool init(const tinygltf::Scene & scene) {
 		}
 	}
 
+	//set scenes here
+    for(auto& s : scenes)
+    {
+	rasterizeSetBuffers(s); 
+    }
+
+#ifdef CUDA_STRIKE
+    for(int i = 0; i < object_copies; i++)
+    {
+      copy_object(0);	    
+    }
+#endif
 
-	rasterizeSetBuffers(scene);
+    float i = 0.0f;
+    for(auto& object : objects)
+    {
+	    object.transformation += glm::vec3(i, 0.0f, 0.0f);
+	    i += 10.0f;
+    }
 
     GLuint passthroughProgram;
     passthroughProgram = initShader();
@@ -369,6 +512,10 @@ void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods)
 
 double lastx = (double)width / 2;
 double lasty = (double)height / 2;
+float yaw = -90.0f;
+float pitch = 0.0f;
+float sensitivity = 0.05;
+
 void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 {
 	const double s_r = 0.01;
@@ -379,6 +526,24 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 	lastx = xpos;
 	lasty = ypos;
 
+	//move in camera
+	diffx *= sensitivity;
+	diffy *= sensitivity;
+
+	yaw -= diffx;
+	pitch -= diffy;
+
+	pitch = glm::clamp<float>(pitch, -90.0f, 90.0f);
+	camera_front = 
+	  {
+	  	cos(glm::radians(yaw)) * cos(glm::radians(pitch)),
+		sin(glm::radians(pitch)),
+		sin(glm::radians(yaw)) * cos(glm::radians(pitch))
+	  };
+	camera_front = glm::normalize(camera_front);
+        camera_right = glm::normalize(glm::cross(camera_front, camera_up));
+        //camera_up = glm::normalize(glm::cross(camera_right, camera_front));
+
 	if (mouseState == ROTATE)
 	{
 		//rotate
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..ae07f91 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -11,6 +11,8 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include <thrust/random.h>
+#include <thrust/device_ptr.h>
+#include <thrust/partition.h>
 #include <util/checkCUDAError.h>
 #include <util/tiny_gltf_loader.h>
 #include "rasterizeTools.h"
@@ -18,168 +20,141 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
-namespace {
-
-	typedef unsigned short VertexIndex;
-	typedef glm::vec3 VertexAttributePosition;
-	typedef glm::vec3 VertexAttributeNormal;
-	typedef glm::vec2 VertexAttributeTexcoord;
-	typedef unsigned char TextureData;
-
-	typedef unsigned char BufferByte;
-
-	enum PrimitiveType{
-		Point = 1,
-		Line = 2,
-		Triangle = 3
-	};
-
-	struct VertexOut {
-		glm::vec4 pos;
-
-		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
-	};
-
-	struct Primitive {
-		PrimitiveType primitiveType = Triangle;	// C++ 11 init
-		VertexOut v[3];
-	};
-
-	struct Fragment {
-		glm::vec3 color;
-
-		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
-	};
-
-	struct PrimitiveDevBufPointers {
-		int primitiveMode;	//from tinygltfloader macro
-		PrimitiveType primitiveType;
-		int numPrimitives;
-		int numIndices;
-		int numVertices;
-
-		// Vertex In, const after loaded
-		VertexIndex* dev_indices;
-		VertexAttributePosition* dev_position;
-		VertexAttributeNormal* dev_normal;
-		VertexAttributeTexcoord* dev_texcoord0;
-
-		// Materials, add more attributes when needed
-		TextureData* dev_diffuseTex;
-		int diffuseTexWidth;
-		int diffuseTexHeight;
-		// TextureData* dev_specularTex;
-		// TextureData* dev_normalTex;
-		// ...
-
-		// Vertex Out, vertex used for rasterization, this is changing every frame
-		VertexOut* dev_verticesOut;
-
-		// TODO: add more attributes when needed
-	};
+#define LAMBERT_SHADING 1
+#define BLINN_PHONG_SHADING 1
+#define BACKFACE_CULLING 1
+#define BILINEAR_FILTERING 1
 
-}
+//happens by default now since added check
+//#define COLOR_TRIANGLE_INTERPOLATION 1
+std::chrono::time_point<std::chrono::high_resolution_clock> clock_now;
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
-
 static int width = 0;
 static int height = 0;
 
 static int totalNumPrimitives = 0;
-static Primitive *dev_primitives = NULL;
-static Fragment *dev_fragmentBuffer = NULL;
-static glm::vec3 *dev_framebuffer = NULL;
+static Primitive* dev_primitives = NULL;
+static Fragment* dev_fragmentBuffer = NULL;
+static glm::vec3* dev_framebuffer = NULL;
+
+static int* dev_depth = NULL; // you might need this buffer when doing depth test
 
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+//lights in scene
+static glm::vec3* dev_lights = NULL; 
+const int num_lights = 2;
+
+//array of objects
+std::vector<ObjectData> objects;
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
-__global__ 
-void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
+__global__
 
-    if (x < w && y < h) {
-        glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
-        // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
-    }
+void sendImageToPBO(uchar4* pbo, int w, int h, glm::vec3* image)
+{
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+  if (x < w && y < h)
+  {
+    glm::vec3 color;
+    color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
+    color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
+    color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+    // Each thread writes one pixel location in the texture (textel)
+    pbo[index].w = 0;
+    pbo[index].x = color.x;
+    pbo[index].y = color.y;
+    pbo[index].z = color.z;
+  }
 }
 
 /** 
 * Writes fragment colors to the framebuffer
 */
 __global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
-
-		// TODO: add your fragment shader code here
+void render(int w, int h, Fragment* fragmentBuffer, glm::vec3* framebuffer, glm::vec3* lights, int num_lights, glm::vec3 camera_pos)
+{
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * w);
+  if (x < w && y < h)
+  {
+    glm::vec3 eye_pos = fragmentBuffer[index].eyePos;
+    glm::vec3 eye_normal = fragmentBuffer[index].eyeNor;
+
+    glm::vec3 pixel_color = fragmentBuffer[index].color;
+
+    // TODO: adsd your fragment shader code here
+#ifdef LAMBERT_SHADING
+    for(int i = 0; i < num_lights; i++)
+    {
+      glm::vec3& light_source = lights[i];
+      glm::vec3 light_direction = glm::normalize(light_source - eye_pos);
+      float amount_of_light = glm::max(glm::dot(light_direction, eye_normal), 0.0f);
+#ifdef BLINN_PHONG_SHADING
+      glm::vec3 eye_direction = glm::normalize(camera_pos - eye_pos);
+      glm::vec3 half_direction = glm::normalize(light_direction + eye_direction);
+      amount_of_light = glm::pow(glm::max(glm::dot(light_direction, half_direction), 0.0f), 8.0f);
+#endif
+      pixel_color += fragmentBuffer[index].color * amount_of_light;
+    }
+#endif
 
+    //hack to get multiple objects to work (check if don't overwrite if not black) DOESN"T CHECK FOR DEPTH BUFFER...
+    if(framebuffer[index] == glm::vec3(0.0f))
+    {
+      framebuffer[index] = pixel_color;
     }
+  }
 }
 
 /**
  * Called once at the beginning of the program to allocate memory.
  */
-void rasterizeInit(int w, int h) {
-    width = w;
-    height = h;
-	cudaFree(dev_fragmentBuffer);
-	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-    cudaFree(dev_framebuffer);
-    cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
-    cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
-	cudaFree(dev_depth);
-	cudaMalloc(&dev_depth, width * height * sizeof(int));
-
-	checkCUDAError("rasterizeInit");
+void rasterizeInit(int w, int h)
+{
+  width = w;
+  height = h;
+  cudaFree(dev_fragmentBuffer);
+  cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  cudaFree(dev_framebuffer);
+  cudaMalloc(&dev_framebuffer, width * height * sizeof(glm::vec3));
+  cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+  cudaFree(dev_depth);
+  cudaMalloc(&dev_depth, width * height * sizeof(int));
+  cudaFree(dev_lights);
+  cudaMalloc(&dev_lights, num_lights * sizeof(glm::vec3));
+  cudaMemset(dev_lights, 0, num_lights * sizeof(glm::vec3));
+
+  //init lights here
+  glm::vec3 cpu_lights[num_lights] =
+  {
+    { 2.0f, 2.0f, 2.0f },
+    { -2.0f, 2.0f, 2.0f },
+  };
+
+  cudaMemcpy(dev_lights, cpu_lights, num_lights * sizeof(glm::vec3), cudaMemcpyHostToDevice);
+
+  checkCUDAError("rasterizeInit");
 }
 
 __global__
-void initDepth(int w, int h, int * depth)
+
+void initDepth(int w, int h, int* depth)
 {
-	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-
-	if (x < w && y < h)
-	{
-		int index = x + (y * w);
-		depth[index] = INT_MAX;
-	}
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  if (x < w && y < h)
+  {
+    int index = x + (y * w);
+    depth[index] = INT_MAX;
+  }
 }
 
 
@@ -187,590 +162,873 @@ void initDepth(int w, int h, int * depth)
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
 */
-__global__ 
-void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) {
-	
-	// Attribute (vec3 position)
-	// component (3 * float)
-	// byte (4 * byte)
-
-	// id of component
-	int i = (blockIdx.x * blockDim.x) + threadIdx.x;
-
-	if (i < N) {
-		int count = i / n;
-		int offset = i - count * n;	// which component of the attribute
-
-		for (int j = 0; j < componentTypeByteSize; j++) {
-			
-			dev_dst[count * componentTypeByteSize * n 
-				+ offset * componentTypeByteSize 
-				+ j]
-
-				= 
-
-			dev_src[byteOffset 
-				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride) 
-				+ offset * componentTypeByteSize 
-				+ j];
-		}
-	}
-	
+__global__
 
+void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset,
+                       int componentTypeByteSize)
+{
+  // Attribute (vec3 position)
+  // component (3 * float)
+  // byte (4 * byte)
+  // id of component
+  int i = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (i < N)
+  {
+    int count = i / n;
+    int offset = i - count * n; // which component of the attribute
+    for (int j = 0; j < componentTypeByteSize; j++)
+    {
+      dev_dst[count * componentTypeByteSize * n
+          + offset * componentTypeByteSize
+          + j]
+        =
+        dev_src[byteOffset
+          + count * (byteStride == 0 ? componentTypeByteSize * n : byteStride)
+          + offset * componentTypeByteSize
+          + j];
+    }
+  }
 }
 
 __global__
+
 void _nodeMatrixTransform(
-	int numVertices,
-	VertexAttributePosition* position,
-	VertexAttributeNormal* normal,
-	glm::mat4 MV, glm::mat3 MV_normal) {
-
-	// vertex id
-	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-		position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
-		normal[vid] = glm::normalize(MV_normal * normal[vid]);
-	}
+  int numVertices,
+  VertexAttributePosition* position,
+  VertexAttributeNormal* normal,
+  glm::mat4 MV, glm::mat3 MV_normal)
+{
+  // vertex id
+  int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (vid < numVertices)
+  {
+    position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
+    normal[vid] = glm::normalize(MV_normal * normal[vid]);
+  }
 }
 
-glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
-	
-	glm::mat4 curMatrix(1.0);
-
-	const std::vector<double> &m = n.matrix;
-	if (m.size() > 0) {
-		// matrix, copy it
-
-		for (int i = 0; i < 4; i++) {
-			for (int j = 0; j < 4; j++) {
-				curMatrix[i][j] = (float)m.at(4 * i + j);
-			}
-		}
-	} else {
-		// no matrix, use rotation, scale, translation
-
-		if (n.translation.size() > 0) {
-			curMatrix[3][0] = n.translation[0];
-			curMatrix[3][1] = n.translation[1];
-			curMatrix[3][2] = n.translation[2];
-		}
-
-		if (n.rotation.size() > 0) {
-			glm::mat4 R;
-			glm::quat q;
-			q[0] = n.rotation[0];
-			q[1] = n.rotation[1];
-			q[2] = n.rotation[2];
-
-			R = glm::mat4_cast(q);
-			curMatrix = curMatrix * R;
-		}
-
-		if (n.scale.size() > 0) {
-			curMatrix = curMatrix * glm::scale(glm::vec3(n.scale[0], n.scale[1], n.scale[2]));
-		}
-	}
-
-	return curMatrix;
+glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node& n)
+{
+  glm::mat4 curMatrix(1.0);
+  const std::vector<double>& m = n.matrix;
+  if (m.size() > 0)
+  {
+    // matrix, copy it
+    for (int i = 0; i < 4; i++)
+    {
+      for (int j = 0; j < 4; j++)
+      {
+        curMatrix[i][j] = (float)m.at(4 * i + j);
+      }
+    }
+  }
+  else
+  {
+    // no matrix, use rotation, scale, translation
+    if (n.translation.size() > 0)
+    {
+      curMatrix[3][0] = n.translation[0];
+      curMatrix[3][1] = n.translation[1];
+      curMatrix[3][2] = n.translation[2];
+    }
+    if (n.rotation.size() > 0)
+    {
+      glm::mat4 R;
+      glm::quat q;
+      q[0] = n.rotation[0];
+      q[1] = n.rotation[1];
+      q[2] = n.rotation[2];
+      R = glm::mat4_cast(q);
+      curMatrix = curMatrix * R;
+    }
+    if (n.scale.size() > 0)
+    {
+      curMatrix = curMatrix * glm::scale(glm::vec3(n.scale[0], n.scale[1], n.scale[2]));
+    }
+  }
+  return curMatrix;
 }
 
-void traverseNode (
-	std::map<std::string, glm::mat4> & n2m,
-	const tinygltf::Scene & scene,
-	const std::string & nodeString,
-	const glm::mat4 & parentMatrix
-	) 
+void traverseNode(
+  std::map<std::string, glm::mat4>& n2m,
+  const tinygltf::Scene& scene,
+  const std::string& nodeString,
+  const glm::mat4& parentMatrix
+)
 {
-	const tinygltf::Node & n = scene.nodes.at(nodeString);
-	glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
-	n2m.insert(std::pair<std::string, glm::mat4>(nodeString, M));
-
-	auto it = n.children.begin();
-	auto itEnd = n.children.end();
-
-	for (; it != itEnd; ++it) {
-		traverseNode(n2m, scene, *it, M);
-	}
+  const tinygltf::Node& n = scene.nodes.at(nodeString);
+  glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
+  n2m.insert(std::pair<std::string, glm::mat4>(nodeString, M));
+  auto it = n.children.begin();
+  auto itEnd = n.children.end();
+  for (; it != itEnd; ++it)
+  {
+    traverseNode(n2m, scene, *it, M);
+  }
 }
 
-void rasterizeSetBuffers(const tinygltf::Scene & scene) {
-
-	totalNumPrimitives = 0;
-
-	std::map<std::string, BufferByte*> bufferViewDevPointers;
-
-	// 1. copy all `bufferViews` to device memory
-	{
-		std::map<std::string, tinygltf::BufferView>::const_iterator it(
-			scene.bufferViews.begin());
-		std::map<std::string, tinygltf::BufferView>::const_iterator itEnd(
-			scene.bufferViews.end());
-
-		for (; it != itEnd; it++) {
-			const std::string key = it->first;
-			const tinygltf::BufferView &bufferView = it->second;
-			if (bufferView.target == 0) {
-				continue; // Unsupported bufferView.
-			}
-
-			const tinygltf::Buffer &buffer = scene.buffers.at(bufferView.buffer);
-
-			BufferByte* dev_bufferView;
-			cudaMalloc(&dev_bufferView, bufferView.byteLength);
-			cudaMemcpy(dev_bufferView, &buffer.data.front() + bufferView.byteOffset, bufferView.byteLength, cudaMemcpyHostToDevice);
-
-			checkCUDAError("Set BufferView Device Mem");
-
-			bufferViewDevPointers.insert(std::make_pair(key, dev_bufferView));
-
-		}
-	}
-
-
-
-	// 2. for each mesh: 
-	//		for each primitive: 
-	//			build device buffer of indices, materail, and each attributes
-	//			and store these pointers in a map
-	{
-
-		std::map<std::string, glm::mat4> nodeString2Matrix;
-		auto rootNodeNamesList = scene.scenes.at(scene.defaultScene);
-
-		{
-			auto it = rootNodeNamesList.begin();
-			auto itEnd = rootNodeNamesList.end();
-			for (; it != itEnd; ++it) {
-				traverseNode(nodeString2Matrix, scene, *it, glm::mat4(1.0f));
-			}
-		}
-
-
-		// parse through node to access mesh
-
-		auto itNode = nodeString2Matrix.begin();
-		auto itEndNode = nodeString2Matrix.end();
-		for (; itNode != itEndNode; ++itNode) {
-
-			const tinygltf::Node & N = scene.nodes.at(itNode->first);
-			const glm::mat4 & matrix = itNode->second;
-			const glm::mat3 & matrixNormal = glm::transpose(glm::inverse(glm::mat3(matrix)));
-
-			auto itMeshName = N.meshes.begin();
-			auto itEndMeshName = N.meshes.end();
-
-			for (; itMeshName != itEndMeshName; ++itMeshName) {
-
-				const tinygltf::Mesh & mesh = scene.meshes.at(*itMeshName);
-
-				auto res = mesh2PrimitivesMap.insert(std::pair<std::string, std::vector<PrimitiveDevBufPointers>>(mesh.name, std::vector<PrimitiveDevBufPointers>()));
-				std::vector<PrimitiveDevBufPointers> & primitiveVector = (res.first)->second;
-
-				// for each primitive
-				for (size_t i = 0; i < mesh.primitives.size(); i++) {
-					const tinygltf::Primitive &primitive = mesh.primitives[i];
-
-					if (primitive.indices.empty())
-						return;
-
-					// TODO: add new attributes for your PrimitiveDevBufPointers when you add new attributes
-					VertexIndex* dev_indices = NULL;
-					VertexAttributePosition* dev_position = NULL;
-					VertexAttributeNormal* dev_normal = NULL;
-					VertexAttributeTexcoord* dev_texcoord0 = NULL;
-
-					// ----------Indices-------------
-
-					const tinygltf::Accessor &indexAccessor = scene.accessors.at(primitive.indices);
-					const tinygltf::BufferView &bufferView = scene.bufferViews.at(indexAccessor.bufferView);
-					BufferByte* dev_bufferView = bufferViewDevPointers.at(indexAccessor.bufferView);
-
-					// assume type is SCALAR for indices
-					int n = 1;
-					int numIndices = indexAccessor.count;
-					int componentTypeByteSize = sizeof(VertexIndex);
-					int byteLength = numIndices * n * componentTypeByteSize;
-
-					dim3 numThreadsPerBlock(128);
-					dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-					cudaMalloc(&dev_indices, byteLength);
-					_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-						numIndices,
-						(BufferByte*)dev_indices,
-						dev_bufferView,
-						n,
-						indexAccessor.byteStride,
-						indexAccessor.byteOffset,
-						componentTypeByteSize);
-
-
-					checkCUDAError("Set Index Buffer");
-
-
-					// ---------Primitive Info-------
-
-					// Warning: LINE_STRIP is not supported in tinygltfloader
-					int numPrimitives;
-					PrimitiveType primitiveType;
-					switch (primitive.mode) {
-					case TINYGLTF_MODE_TRIANGLES:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices / 3;
-						break;
-					case TINYGLTF_MODE_TRIANGLE_STRIP:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices - 2;
-						break;
-					case TINYGLTF_MODE_TRIANGLE_FAN:
-						primitiveType = PrimitiveType::Triangle;
-						numPrimitives = numIndices - 2;
-						break;
-					case TINYGLTF_MODE_LINE:
-						primitiveType = PrimitiveType::Line;
-						numPrimitives = numIndices / 2;
-						break;
-					case TINYGLTF_MODE_LINE_LOOP:
-						primitiveType = PrimitiveType::Line;
-						numPrimitives = numIndices + 1;
-						break;
-					case TINYGLTF_MODE_POINTS:
-						primitiveType = PrimitiveType::Point;
-						numPrimitives = numIndices;
-						break;
-					default:
-						// output error
-						break;
-					};
-
-
-					// ----------Attributes-------------
-
-					auto it(primitive.attributes.begin());
-					auto itEnd(primitive.attributes.end());
-
-					int numVertices = 0;
-					// for each attribute
-					for (; it != itEnd; it++) {
-						const tinygltf::Accessor &accessor = scene.accessors.at(it->second);
-						const tinygltf::BufferView &bufferView = scene.bufferViews.at(accessor.bufferView);
-
-						int n = 1;
-						if (accessor.type == TINYGLTF_TYPE_SCALAR) {
-							n = 1;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC2) {
-							n = 2;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC3) {
-							n = 3;
-						}
-						else if (accessor.type == TINYGLTF_TYPE_VEC4) {
-							n = 4;
-						}
-
-						BufferByte * dev_bufferView = bufferViewDevPointers.at(accessor.bufferView);
-						BufferByte ** dev_attribute = NULL;
-
-						numVertices = accessor.count;
-						int componentTypeByteSize;
-
-						// Note: since the type of our attribute array (dev_position) is static (float32)
-						// We assume the glTF model attribute type are 5126(FLOAT) here
-
-						if (it->first.compare("POSITION") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributePosition) / n;
-							dev_attribute = (BufferByte**)&dev_position;
-						}
-						else if (it->first.compare("NORMAL") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributeNormal) / n;
-							dev_attribute = (BufferByte**)&dev_normal;
-						}
-						else if (it->first.compare("TEXCOORD_0") == 0) {
-							componentTypeByteSize = sizeof(VertexAttributeTexcoord) / n;
-							dev_attribute = (BufferByte**)&dev_texcoord0;
-						}
-
-						std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
-
-						dim3 numThreadsPerBlock(128);
-						dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-						int byteLength = numVertices * n * componentTypeByteSize;
-						cudaMalloc(dev_attribute, byteLength);
-
-						_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-							n * numVertices,
-							*dev_attribute,
-							dev_bufferView,
-							n,
-							accessor.byteStride,
-							accessor.byteOffset,
-							componentTypeByteSize);
-
-						std::string msg = "Set Attribute Buffer: " + it->first;
-						checkCUDAError(msg.c_str());
-					}
-
-					// malloc for VertexOut
-					VertexOut* dev_vertexOut;
-					cudaMalloc(&dev_vertexOut, numVertices * sizeof(VertexOut));
-					checkCUDAError("Malloc VertexOut Buffer");
-
-					// ----------Materials-------------
-
-					// You can only worry about this part once you started to 
-					// implement textures for your rasterizer
-					TextureData* dev_diffuseTex = NULL;
-					int diffuseTexWidth = 0;
-					int diffuseTexHeight = 0;
-					if (!primitive.material.empty()) {
-						const tinygltf::Material &mat = scene.materials.at(primitive.material);
-						printf("material.name = %s\n", mat.name.c_str());
-
-						if (mat.values.find("diffuse") != mat.values.end()) {
-							std::string diffuseTexName = mat.values.at("diffuse").string_value;
-							if (scene.textures.find(diffuseTexName) != scene.textures.end()) {
-								const tinygltf::Texture &tex = scene.textures.at(diffuseTexName);
-								if (scene.images.find(tex.source) != scene.images.end()) {
-									const tinygltf::Image &image = scene.images.at(tex.source);
-
-									size_t s = image.image.size() * sizeof(TextureData);
-									cudaMalloc(&dev_diffuseTex, s);
-									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
-									
-									diffuseTexWidth = image.width;
-									diffuseTexHeight = image.height;
-
-									checkCUDAError("Set Texture Image data");
-								}
-							}
-						}
-
-						// TODO: write your code for other materails
-						// You may have to take a look at tinygltfloader
-						// You can also use the above code loading diffuse material as a start point 
-					}
-
-
-					// ---------Node hierarchy transform--------
-					cudaDeviceSynchronize();
-					
-					dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-					_nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> > (
-						numVertices,
-						dev_position,
-						dev_normal,
-						matrix,
-						matrixNormal);
-
-					checkCUDAError("Node hierarchy transformation");
-
-					// at the end of the for loop of primitive
-					// push dev pointers to map
-					primitiveVector.push_back(PrimitiveDevBufPointers{
-						primitive.mode,
-						primitiveType,
-						numPrimitives,
-						numIndices,
-						numVertices,
-
-						dev_indices,
-						dev_position,
-						dev_normal,
-						dev_texcoord0,
-
-						dev_diffuseTex,
-						diffuseTexWidth,
-						diffuseTexHeight,
-
-						dev_vertexOut	//VertexOut
-					});
-
-					totalNumPrimitives += numPrimitives;
-
-				} // for each primitive
-
-			} // for each mesh
-
-		} // for each node
-
-	}
-	
-
-	// 3. Malloc for dev_primitives
-	{
-		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
-	}
-	
-
-	// Finally, cudaFree raw dev_bufferViews
-	{
-
-		std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
-		std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
-			
-			//bufferViewDevPointers
-
-		for (; it != itEnd; it++) {
-			cudaFree(it->second);
-		}
-
-		checkCUDAError("Free BufferView Device Mem");
-	}
+void set_scene(int index)
+{
+  if(index >= 0 && index < objects.size())
+  {
+    dev_primitives = objects[index].dev_primitives;
+    totalNumPrimitives = objects[index].totalNumPrimitives;
+  }
+}
 
+void copy_object(int index)
+{
+  if (index >= 0 && index < objects.size())
+  {
+    //copy over pointer and primitives
+    ObjectData object_data;
+    object_data.dev_primitives = dev_primitives;
+    object_data.totalNumPrimitives = totalNumPrimitives;
+    object_data.is_copy = true;
+    objects.push_back(object_data);
+  }
+}
 
+void rasterizeSetBuffers(const tinygltf::Scene& scene)
+{
+  
+  totalNumPrimitives = 0;
+  std::map<std::string, BufferByte*> bufferViewDevPointers;
+  // 1. copy all `bufferViews` to device memory
+  {
+    std::map<std::string, tinygltf::BufferView>::const_iterator it(
+      scene.bufferViews.begin());
+    std::map<std::string, tinygltf::BufferView>::const_iterator itEnd(
+      scene.bufferViews.end());
+    for (; it != itEnd; it++)
+    {
+      const std::string key = it->first;
+      const tinygltf::BufferView& bufferView = it->second;
+      if (bufferView.target == 0)
+      {
+        continue; // Unsupported bufferView.
+      }
+      const tinygltf::Buffer& buffer = scene.buffers.at(bufferView.buffer);
+      BufferByte* dev_bufferView;
+      cudaMalloc(&dev_bufferView, bufferView.byteLength);
+      cudaMemcpy(dev_bufferView, &buffer.data.front() + bufferView.byteOffset, bufferView.byteLength,
+                 cudaMemcpyHostToDevice);
+      checkCUDAError("Set BufferView Device Mem");
+      bufferViewDevPointers.insert(std::make_pair(key, dev_bufferView));
+    }
+  }
+  // 2. for each mesh: 
+  //    for each primitive: 
+  //      build device buffer of indices, materail, and each attributes
+  //      and store these pointers in a map
+  {
+    std::map<std::string, glm::mat4> nodeString2Matrix;
+    auto rootNodeNamesList = scene.scenes.at(scene.defaultScene);
+    {
+      auto it = rootNodeNamesList.begin();
+      auto itEnd = rootNodeNamesList.end();
+      for (; it != itEnd; ++it)
+      {
+        traverseNode(nodeString2Matrix, scene, *it, glm::mat4(1.0f));
+      }
+    }
+    // parse through node to access mesh
+    auto itNode = nodeString2Matrix.begin();
+    auto itEndNode = nodeString2Matrix.end();
+    for (; itNode != itEndNode; ++itNode)
+    {
+      const tinygltf::Node& N = scene.nodes.at(itNode->first);
+      const glm::mat4& matrix = itNode->second;
+      const glm::mat3& matrixNormal = glm::transpose(glm::inverse(glm::mat3(matrix)));
+      auto itMeshName = N.meshes.begin();
+      auto itEndMeshName = N.meshes.end();
+      for (; itMeshName != itEndMeshName; ++itMeshName)
+      {
+        const tinygltf::Mesh& mesh = scene.meshes.at(*itMeshName);
+        auto res = mesh2PrimitivesMap.insert(
+          std::pair<std::string, std::vector<PrimitiveDevBufPointers>>(mesh.name, std::vector<PrimitiveDevBufPointers>()));
+        std::vector<PrimitiveDevBufPointers>& primitiveVector = (res.first)->second;
+        // for each primitive
+        for (size_t i = 0; i < mesh.primitives.size(); i++)
+        {
+          const tinygltf::Primitive& primitive = mesh.primitives[i];
+          if (primitive.indices.empty())
+            return;
+          // TODO: add new attributes for your PrimitiveDevBufPointers when you add new attributes
+          VertexIndex* dev_indices = NULL;
+          VertexAttributePosition* dev_position = NULL;
+          VertexAttributeNormal* dev_normal = NULL;
+          VertexAttributeTexcoord* dev_texcoord0 = NULL;
+          // ----------Indices-------------
+          const tinygltf::Accessor& indexAccessor = scene.accessors.at(primitive.indices);
+          const tinygltf::BufferView& bufferView = scene.bufferViews.at(indexAccessor.bufferView);
+          BufferByte* dev_bufferView = bufferViewDevPointers.at(indexAccessor.bufferView);
+          // assume type is SCALAR for indices
+          int n = 1;
+          int numIndices = indexAccessor.count;
+          int componentTypeByteSize = sizeof(VertexIndex);
+          int byteLength = numIndices * n * componentTypeByteSize;
+          dim3 numThreadsPerBlock(128);
+          dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+          cudaMalloc(&dev_indices, byteLength);
+          _deviceBufferCopy << <numBlocks, numThreadsPerBlock >> >(
+            numIndices,
+            (BufferByte*)dev_indices,
+            dev_bufferView,
+            n,
+            indexAccessor.byteStride,
+            indexAccessor.byteOffset,
+            componentTypeByteSize);
+          checkCUDAError("Set Index Buffer");
+          // ---------Primitive Info-------
+          // Warning: LINE_STRIP is not supported in tinygltfloader
+          int numPrimitives;
+          PrimitiveType primitiveType;
+          switch (primitive.mode)
+          {
+          case TINYGLTF_MODE_TRIANGLES:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices / 3;
+            break;
+          case TINYGLTF_MODE_TRIANGLE_STRIP:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices - 2;
+            break;
+          case TINYGLTF_MODE_TRIANGLE_FAN:
+            primitiveType = PrimitiveType::Triangle;
+            numPrimitives = numIndices - 2;
+            break;
+          case TINYGLTF_MODE_LINE:
+            primitiveType = PrimitiveType::Line;
+            numPrimitives = numIndices / 2;
+            break;
+          case TINYGLTF_MODE_LINE_LOOP:
+            primitiveType = PrimitiveType::Line;
+            numPrimitives = numIndices + 1;
+            break;
+          case TINYGLTF_MODE_POINTS:
+            primitiveType = PrimitiveType::Point;
+            numPrimitives = numIndices;
+            break;
+          default:
+            // output error
+            break;
+          };
+          // ----------Attributes-------------
+          auto it(primitive.attributes.begin());
+          auto itEnd(primitive.attributes.end());
+          int numVertices = 0;
+          // for each attribute
+          for (; it != itEnd; it++)
+          {
+            const tinygltf::Accessor& accessor = scene.accessors.at(it->second);
+            const tinygltf::BufferView& bufferView = scene.bufferViews.at(accessor.bufferView);
+            int n = 1;
+            if (accessor.type == TINYGLTF_TYPE_SCALAR)
+            {
+              n = 1;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC2)
+            {
+              n = 2;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC3)
+            {
+              n = 3;
+            }
+            else if (accessor.type == TINYGLTF_TYPE_VEC4)
+            {
+              n = 4;
+            }
+            BufferByte* dev_bufferView = bufferViewDevPointers.at(accessor.bufferView);
+            BufferByte** dev_attribute = NULL;
+            numVertices = accessor.count;
+            int componentTypeByteSize;
+            // Note: since the type of our attribute array (dev_position) is static (float32)
+            // We assume the glTF model attribute type are 5126(FLOAT) here
+            if (it->first.compare("POSITION") == 0)
+            {
+              componentTypeByteSize = sizeof(VertexAttributePosition) / n;
+              dev_attribute = (BufferByte**)&dev_position;
+            }
+            else if (it->first.compare("NORMAL") == 0)
+            {
+              componentTypeByteSize = sizeof(VertexAttributeNormal) / n;
+              dev_attribute = (BufferByte**)&dev_normal;
+            }
+            else if (it->first.compare("TEXCOORD_0") == 0)
+            {
+              componentTypeByteSize = sizeof(VertexAttributeTexcoord) / n;
+              dev_attribute = (BufferByte**)&dev_texcoord0;
+            }
+            std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
+            dim3 numThreadsPerBlock(128);
+            dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+            int byteLength = numVertices * n * componentTypeByteSize;
+
+            cudaMalloc(dev_attribute, byteLength);
+            _deviceBufferCopy << <numBlocks, numThreadsPerBlock >> >(
+              n * numVertices,
+              *dev_attribute,
+              dev_bufferView,
+              n,
+              accessor.byteStride,
+              accessor.byteOffset,
+              componentTypeByteSize);
+            std::string msg = "Set Attribute Buffer: " + it->first;
+            checkCUDAError(msg.c_str());
+          }
+          // malloc for VertexOut
+          VertexOut* dev_vertexOut;
+          cudaMalloc(&dev_vertexOut, numVertices * sizeof(VertexOut));
+          checkCUDAError("Malloc VertexOut Buffer");
+          // ----------Materials-------------
+          // You can only worry about this part once you started to 
+          // implement textures for your rasterizer
+          TextureData* dev_diffuseTex = NULL;
+          int diffuseTexWidth = 0;
+          int diffuseTexHeight = 0;
+          if (!primitive.material.empty())
+          {
+            const tinygltf::Material& mat = scene.materials.at(primitive.material);
+            printf("material.name = %s\n", mat.name.c_str());
+            if (mat.values.find("diffuse") != mat.values.end())
+            {
+              std::string diffuseTexName = mat.values.at("diffuse").string_value;
+              if (scene.textures.find(diffuseTexName) != scene.textures.end())
+              {
+                const tinygltf::Texture& tex = scene.textures.at(diffuseTexName);
+                if (scene.images.find(tex.source) != scene.images.end())
+                {
+                  const tinygltf::Image& image = scene.images.at(tex.source);
+                  size_t s = image.image.size() * sizeof(TextureData);
+
+                  cudaMalloc(&dev_diffuseTex, s);
+                  cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
+                  diffuseTexWidth = image.width;
+                  diffuseTexHeight = image.height;
+                  checkCUDAError("Set Texture Image data");
+                }
+              }
+            }
+            // TODO: write your code for other materails
+            // You may have to take a look at tinygltfloader
+            // You can also use the above code loading diffuse material as a start point 
+          }
+          // ---------Node hierarchy transform--------
+          cudaDeviceSynchronize();
+          dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+          _nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> >(
+            numVertices,
+            dev_position,
+            dev_normal,
+            matrix,
+            matrixNormal);
+          checkCUDAError("Node hierarchy transformation");
+          // at the end of the for loop of primitive
+          // push dev pointers to map
+          primitiveVector.push_back(PrimitiveDevBufPointers{
+            primitive.mode,
+            primitiveType,
+            numPrimitives,
+            numIndices,
+            numVertices,
+            dev_indices,
+            dev_position,
+            dev_normal,
+            dev_texcoord0,
+            dev_diffuseTex,
+            diffuseTexWidth,
+            diffuseTexHeight,
+            dev_vertexOut //VertexOut
+          });
+          totalNumPrimitives += numPrimitives;
+        } // for each primitive
+      } // for each mesh
+    } // for each node
+  }
+  // 3. Malloc for dev_primitives
+  {
+    cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
+  }
+  // Finally, cudaFree raw dev_bufferViews
+  {
+    std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
+    std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
+    //bufferViewDevPointers
+    for (; it != itEnd; it++)
+    {
+      cudaFree(it->second);
+    }
+    checkCUDAError("Free BufferView Device Mem");
+  }
+
+    //copy over pointer and primitives
+    ObjectData object_data;
+    object_data.dev_primitives = dev_primitives;
+    object_data.totalNumPrimitives = totalNumPrimitives;
+    objects.push_back(object_data);
 }
 
 
+__global__
 
-__global__ 
 void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
-
-	// vertex id
-	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-
-		// TODO: Apply vertex transformation here
-		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
-		// Finally transform x and y to viewport space
-
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
-		
-	}
-}
+  int numVertices,
+  PrimitiveDevBufPointers primitive,
+  glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal,
+  int width, int height)
+{
+  // vertex id
+  int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (vid < numVertices)
+  {
+    //textures
+    primitive.dev_verticesOut[vid].dev_diffuseTex = 0;
+
+    //check if textures exist
+    if (primitive.dev_diffuseTex)
+    {
+      primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+      primitive.dev_verticesOut[vid].texWidth = primitive.diffuseTexWidth;
+      primitive.dev_verticesOut[vid].texHeight = primitive.diffuseTexHeight;
+      primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+    }
 
+    // TODO: Apply vertex transformation here
+    // Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
+    // Then divide the pos by its w element to transform into NDC space
+    // Finally transform x and y to viewport space
+
+    //clip
+    primitive.dev_verticesOut[vid].pos = MVP * glm::vec4(primitive.dev_position[vid], 1.0f);
+
+    //ndc
+    primitive.dev_verticesOut[vid].pos /= primitive.dev_verticesOut[vid].pos.w;
+
+    //screen space
+    const float width_ndc = static_cast<float>(width) * 0.5f;
+    const float height_ndc = static_cast<float>(height) * 0.5f;
+    primitive.dev_verticesOut[vid].pos.x = width_ndc * (primitive.dev_verticesOut[vid].pos.x + 1.0f);
+    primitive.dev_verticesOut[vid].pos.y = height_ndc * (1.0f - primitive.dev_verticesOut[vid].pos.y);
+    primitive.dev_verticesOut[vid].pos.z = 0.5f * (1.0f + primitive.dev_verticesOut[vid].pos.z);
+
+    // TODO: Apply vertex assembly here
+    // Assemble all attribute arraies into the primitive array
+    primitive.dev_verticesOut[vid].eyeNor = MV_normal * primitive.dev_normal[vid];
+    primitive.dev_verticesOut[vid].eyeNor = glm::normalize(primitive.dev_verticesOut[vid].eyeNor);
+    primitive.dev_verticesOut[vid].eyePos = glm::vec3(MV * glm::vec4(primitive.dev_position[vid], 1.0f));
+  }
+}
 
 
 static int curPrimitiveBeginId = 0;
 
-__global__ 
-void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
-
-	// index id
-	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
-
-	if (iid < numIndices) {
+__global__
 
-		// TODO: uncomment the following code for a start
-		// This is primitive assembly for triangles
+void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives,
+                        PrimitiveDevBufPointers primitive)
+{
+  // index id
+  int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (iid < numIndices)
+  {
+    // TODO: uncomment the following code for a start
+    // This is primitive assembly for triangles
+    int pid; // id for cur primitives vector
+    if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES)
+    {
+      pid = iid / (int)primitive.primitiveType;
+      dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+        = primitive.dev_verticesOut[primitive.dev_indices[iid]];
+    }
+    // TODO: other primitive types (point, line)
+  }
+}
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+__global__ void backface_cull(int totalPrimitives, glm::vec3 camera, Primitive* primitives)
+{
+  int vid = blockIdx.x * blockDim.x + threadIdx.x;
+  if (vid < totalPrimitives)
+  {
+    glm::vec3 triangle_pos[3] =
+    {
+      glm::vec3(primitives[vid].v[0].pos),
+      glm::vec3(primitives[vid].v[1].pos),
+      glm::vec3(primitives[vid].v[2].pos)
+    };
+
+    glm::vec3 dir_1 = triangle_pos[0] - triangle_pos[1];
+    glm::vec3 dir_2 = triangle_pos[2] - triangle_pos[2];
+    glm::vec3 triangle_normal = glm::cross(dir_1, dir_2);
+
+    primitives[vid].backface_culled = false;
+
+    if(glm::dot(camera, triangle_normal) < 0.0f)
+    {
+      primitives[vid].backface_culled = true;
+    }
+  }
+}
 
+//stream compaction for backface culling
+struct HostDeviceSteamCompactionCallback {
+  __host__ __device__ bool operator()(const Primitive &p) {
+    return !p.backface_culled;
+  };
+};
 
-		// TODO: other primitive types (point, line)
-	}
-	
+__global__ void rasterize_triangles(int totalPrimitives, int width, int height, int* depths,
+                                    Primitive* primitives, Fragment* fragments)
+{
+  int vid = blockIdx.x * blockDim.x + threadIdx.x;
+  if (vid < totalPrimitives)
+  {
+    glm::vec3 triangle_pos[3] =
+    {
+      glm::vec3(primitives[vid].v[0].pos),
+      glm::vec3(primitives[vid].v[1].pos),
+      glm::vec3(primitives[vid].v[2].pos)
+    };
+
+    glm::vec2 triangle_texcoords[3] =
+    {
+      primitives[vid].v[0].texcoord0,
+      primitives[vid].v[1].texcoord0,
+      primitives[vid].v[2].texcoord0,
+    };
+
+    int texture_width = primitives[vid].v[0].texWidth;
+    int texture_height = primitives[vid].v[0].texHeight;
+
+    //for correct color interpolation
+    glm::vec3 triangle_colors[3] =
+    {
+      glm::vec3(1.0f, 0.0f, 0.0f),
+      glm::vec3(0.0f, 1.0f, 0.0f),
+      glm::vec3(0.0f, 0.0f, 1.0f),
+    };
+
+    glm::vec3 eye_pos[3] =
+    {
+      primitives[vid].v[0].eyePos,
+      primitives[vid].v[1].eyePos,
+      primitives[vid].v[2].eyePos,
+    };
+    
+    glm::vec3 eye_normal[3] =
+    {
+      primitives[vid].v[0].eyeNor,
+      primitives[vid].v[1].eyeNor,
+      primitives[vid].v[2].eyeNor,
+    };
+
+    //get aabb
+    AABB triangle_aabb = getAABBForTriangle(triangle_pos);
+
+    //clamp between screen size
+    triangle_aabb = [width, height](int min_x, int max_x, int min_y, int max_y)
+    {
+      AABB result{};
+      result.min.x = glm::clamp(min_x, 0, width - 1);
+      result.max.x = glm::clamp(max_x, 0, width - 1);
+      result.min.y = glm::clamp(min_y, 0, height - 1);
+      result.max.y = glm::clamp(max_y, 0, height - 1);
+      return result;
+    }(triangle_aabb.min.x, triangle_aabb.max.x,
+      triangle_aabb.min.y, triangle_aabb.max.y);
+
+    //scanline using baycentric
+    for (int x = triangle_aabb.min.x; x <= triangle_aabb.max.x; x++)
+    {
+      for (int y = triangle_aabb.min.y; y <= triangle_aabb.max.y; y++)
+      {
+        //caclulate baycentric (if pixel is on triangle)
+        const glm::vec2 pixel_space{x, y};
+        const glm::vec3 barycentric_coordinate = calculateBarycentricCoordinate(triangle_pos, pixel_space);
+
+        if(isBarycentricCoordInBounds(barycentric_coordinate))
+        {
+          float depth = -getZAtCoordinate(barycentric_coordinate, triangle_pos);
+          float depth_in_int = depth * 1000.0f;
+          int pixel = y * width + x;
+
+          //depth test (get the pixel closest)
+          const int old_depth = atomicMin(&depths[pixel], depth_in_int);
+
+          //fragment shading
+
+          //check if depth was closer (draw pixel on top)
+          if(old_depth != depths[pixel])
+          {
+            float eye_pos1_z = eye_pos[0].z;
+            float eye_pos2_z = eye_pos[1].z;
+            float eye_pos3_z = eye_pos[2].z;
+            float bary_correct_x = barycentric_coordinate.x / eye_pos1_z;
+            float bary_correct_y = barycentric_coordinate.y / eye_pos2_z;
+            float bary_correct_z = barycentric_coordinate.z / eye_pos3_z;
+            float perspective_correct_z = 1.0f / (bary_correct_x + bary_correct_y + bary_correct_z);
+
+            //debugging depth
+            //fragments[pixel].color = glm::vec3(depth);
+
+            //normals
+            //fragments[pixel].color = ;
+
+            //perspective correct normal
+            const glm::vec3 perspective_correct_eye_normal = 
+              (
+              barycentric_coordinate.x * (eye_normal[0] / eye_pos1_z) +
+              barycentric_coordinate.y * (eye_normal[1] / eye_pos2_z) +
+              barycentric_coordinate.z * (eye_normal[2] / eye_pos3_z)
+              ) * perspective_correct_z;
+
+            fragments[pixel].eyeNor = perspective_correct_eye_normal;
+
+            //textures 
+
+            //perspective correct texture coordinate
+            const glm::vec2 perspective_correct_texcoord = 
+              (
+              barycentric_coordinate.x * (triangle_texcoords[0] / eye_pos1_z) +
+              barycentric_coordinate.y * (triangle_texcoords[1] / eye_pos2_z) +
+              barycentric_coordinate.z * (triangle_texcoords[2] / eye_pos3_z)
+              ) * perspective_correct_z;
+
+            fragments[pixel].texcoord0 = perspective_correct_texcoord;
+
+            TextureData* diffuse_texture = primitives[vid].v->dev_diffuseTex;
+            fragments[pixel].dev_diffuseTex = diffuse_texture;
+            if(diffuse_texture)
+            {
+              auto sample_texture = [&](int u, int v)
+              {
+                int v_height = v * texture_width;
+                int u_v_index = 3 * (u + v_height);
+                glm::vec3 texture_color =
+                {
+                  diffuse_texture[u_v_index],
+                  diffuse_texture[u_v_index + 1],
+                  diffuse_texture[u_v_index + 2]
+                };
+                //put in range 0 -> 1
+                texture_color /= 255.0f;
+                return texture_color;
+              };
+
+              //bilinear
+#ifdef BILINEAR_FILTERING
+              float u_float = static_cast<float>(texture_width) * perspective_correct_texcoord[0];
+              float v_float = static_cast<float>(texture_height) * perspective_correct_texcoord[1];
+
+              //4 points
+              int u_int = static_cast<int>(glm::floor(u_float));
+              int v_int = static_cast<int>(glm::floor(v_float));
+              int u_int_plus_one = glm::clamp(u_int + 1, 0, texture_width - 1);
+              int v_int_plus_one = glm::clamp(v_int + 1, 0, texture_height - 1);
+
+              //calculate difference (will be used in mixing
+              float u_diff = u_float - static_cast<float>(u_int);
+              float v_diff = v_float - static_cast<float>(v_int);
+
+              //sample 4 points (bilinear mix between them)
+              const auto sample_mix_1 = glm::mix(sample_texture(u_int, v_int), sample_texture(u_int, v_int_plus_one), v_diff);
+              const auto sample_mix_2 = glm::mix(sample_texture(u_int_plus_one, v_int), sample_texture(u_int_plus_one, v_int_plus_one), v_diff);
+              const auto sample_mix_final = glm::mix(sample_mix_1, sample_mix_2, u_diff);
+
+              fragments[pixel].color = sample_mix_final;
+#else
+              //not bilinear
+              int u = texture_width * perspective_correct_texcoord[0];
+              int v = texture_height * perspective_correct_texcoord[1];
+              fragments[pixel].color = sample_texture(u, v);
+#endif
+            } 
+            //force color triangle interpolation (no texture)
+            else
+            {
+              //perspective correct color
+              const glm::vec3 perspective_correct_color =
+              (
+                barycentric_coordinate.x * (triangle_colors[0] / eye_pos1_z) +
+                barycentric_coordinate.y * (triangle_colors[1] / eye_pos2_z) +
+                barycentric_coordinate.z * (triangle_colors[2] / eye_pos3_z)
+              ) * perspective_correct_z;
+              fragments[pixel].color = perspective_correct_color;
+            }
+#ifdef COLOR_TRIANGLE_INTERPOLATION
+            //perspective correct color
+            const glm::vec3 perspective_correct_color = 
+              (
+              barycentric_coordinate.x * (triangle_colors[0] / eye_pos1_z) +
+              barycentric_coordinate.y * (triangle_colors[1] / eye_pos2_z) +
+              barycentric_coordinate.z * (triangle_colors[2] / eye_pos3_z)
+              ) * perspective_correct_z;
+
+            fragments[pixel].color = perspective_correct_color;
+#endif
+          }
+        }
+      }
+    }
+  }
 }
 
-
+int sideLength2d = 8;
+dim3 blockSize2d(sideLength2d, sideLength2d);
+dim3 blockCount2d((width - 1) / blockSize2d.x + 1,
+                  (height - 1) / blockSize2d.y + 1);
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
-    int sideLength2d = 8;
-    dim3 blockSize2d(sideLength2d, sideLength2d);
-    dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
-		(height - 1) / blockSize2d.y + 1);
-
-	// Execute your rasterization pipeline here
-	// (See README for rasterization pipeline outline.)
-
-	// Vertex Process & primitive assembly
-	{
-		curPrimitiveBeginId = 0;
-		dim3 numThreadsPerBlock(128);
-
-		auto it = mesh2PrimitivesMap.begin();
-		auto itEnd = mesh2PrimitivesMap.end();
-
-		for (; it != itEnd; ++it) {
-			auto p = (it->second).begin();	// each primitive
-			auto pEnd = (it->second).end();
-			for (; p != pEnd; ++p) {
-				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
-				checkCUDAError("Vertex Processing");
-				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
-					curPrimitiveBeginId, 
-					dev_primitives, 
-					*p);
-				checkCUDAError("Primitive Assembly");
-
-				curPrimitiveBeginId += p->numPrimitives;
-			}
-		}
-
-		checkCUDAError("Vertex Processing and Primitive Assembly");
-	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
-
-
-    // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
-	checkCUDAError("fragment shader");
-    // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
+void rasterize(uchar4* pbo, const glm::mat4& MVP, const glm::mat4& MV, const glm::mat3 MV_normal, glm::vec3& camera_pos)
+{
+  blockCount2d = dim3((width - 1) / blockSize2d.x + 1,
+                  (height - 1) / blockSize2d.y + 1);
+
+  // Execute your rasterization pipeline here
+  // (See README for rasterization pipeline outline.)
+  // Vertex Process & primitive assembly
+  {
+    curPrimitiveBeginId = 0;
+    dim3 numThreadsPerBlock(128);
+    auto it = mesh2PrimitivesMap.begin();
+    auto itEnd = mesh2PrimitivesMap.end();
+    for (; it != itEnd; ++it)
+    {
+      auto p = (it->second).begin(); // each primitive
+      auto pEnd = (it->second).end();
+      for (; p != pEnd; ++p)
+      {
+        dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+        dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+#ifdef PRINT_CLOCK
+        clock_now = std::chrono::high_resolution_clock::now();
+#endif
+        _vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(
+          p->numVertices, *p, MVP, MV, MV_normal, width, height);
+        checkCUDAError("Vertex Processing");
+        cudaDeviceSynchronize();
+#ifdef PRINT_CLOCK
+        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - clock_now).count() << std::endl;
+#endif
+
+#ifdef PRINT_CLOCK
+        clock_now = std::chrono::high_resolution_clock::now();
+#endif
+        _primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
+        (p->numIndices,
+         curPrimitiveBeginId,
+         dev_primitives,
+         *p);
+        checkCUDAError("Primitive Assembly");
+        cudaDeviceSynchronize();
+
+#ifdef PRINT_CLOCK
+        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - clock_now).count() << std::endl;
+#endif
+        curPrimitiveBeginId += p->numPrimitives;
+      }
+    }
+    checkCUDAError("Vertex Processing and Primitive Assembly");
+  }
+  cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+  initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
+
+  const int blockSize1d = 128;
+  int remaining_primitives = totalNumPrimitives;
+  dim3 num_triangles((remaining_primitives + blockSize1d - 1) / blockSize1d);
+
+  //backface culling
+#ifdef BACKFACE_CULLING
+  backface_cull<<<num_triangles, blockSize1d>>>(remaining_primitives, camera_pos, dev_primitives);
+  
+  //stream compact away backface culled triangled using thrust
+  thrust::device_ptr<Primitive> dev_primitive_ptr_start = thrust::device_pointer_cast(dev_primitives);
+  thrust::device_ptr<Primitive> dev_primitive_ptr_end = thrust::device_pointer_cast(
+    dev_primitives + remaining_primitives);
+
+  //perform stream compaction  
+  thrust::device_ptr<Primitive> new_dev_primitive_end = thrust::partition(
+    dev_primitive_ptr_start, dev_primitive_ptr_end, HostDeviceSteamCompactionCallback());
+  
+  Primitive* dev_primitive_end = thrust::raw_pointer_cast(new_dev_primitive_end);
+
+  //update the primitive counts
+  remaining_primitives = dev_primitive_end - dev_primitives;
+#endif
+
+  // TODO: rasterize
+#ifdef PRINT_CLOCK
+        clock_now = std::chrono::high_resolution_clock::now();
+#endif
+  rasterize_triangles<<<num_triangles, blockSize1d>>>(remaining_primitives, width, height, dev_depth, dev_primitives, dev_fragmentBuffer);
+        cudaDeviceSynchronize();
+#ifdef PRINT_CLOCK
+        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - clock_now).count() << std::endl;
+#endif
+  // Copy depthbuffer colors into framebuffer
+  glm::vec3 camera_pos_in_MV = glm::vec3(MV * glm::vec4(camera_pos, 1.0f));
+#ifdef PRINT_CLOCK
+        clock_now = std::chrono::high_resolution_clock::now();
+#endif
+  render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer, dev_lights, num_lights, camera_pos_in_MV);
+        cudaDeviceSynchronize();
+#ifdef PRINT_CLOCK
+        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - clock_now).count() << std::endl;
+#endif
+  checkCUDAError("fragment shader");
+
+}
+
+void zero_frame_buffer()
+{
+  cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+}
+
+void write_to_pbo(uchar4* pbo)
+{
+  // Copy framebuffer into OpenGL buffer for OpenGL previewing
+  sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
+  checkCUDAError("copy render result to pbo");
 }
 
 /**
  * Called once at the end of the program to free CUDA memory.
  */
-void rasterizeFree() {
-
-    // deconstruct primitives attribute/indices device buffer
-
-	auto it(mesh2PrimitivesMap.begin());
-	auto itEnd(mesh2PrimitivesMap.end());
-	for (; it != itEnd; ++it) {
-		for (auto p = it->second.begin(); p != it->second.end(); ++p) {
-			cudaFree(p->dev_indices);
-			cudaFree(p->dev_position);
-			cudaFree(p->dev_normal);
-			cudaFree(p->dev_texcoord0);
-			cudaFree(p->dev_diffuseTex);
-
-			cudaFree(p->dev_verticesOut);
-
-			
-			//TODO: release other attributes and materials
-		}
-	}
-
-	////////////
-
-    cudaFree(dev_primitives);
-    dev_primitives = NULL;
-
-	cudaFree(dev_fragmentBuffer);
-	dev_fragmentBuffer = NULL;
-
-    cudaFree(dev_framebuffer);
-    dev_framebuffer = NULL;
-
-	cudaFree(dev_depth);
-	dev_depth = NULL;
-
-    checkCUDAError("rasterize Free");
+void rasterizeFree()
+{
+  // deconstruct primitives attribute/indices device buffer
+  auto it(mesh2PrimitivesMap.begin());
+  auto itEnd(mesh2PrimitivesMap.end());
+  for (; it != itEnd; ++it)
+  {
+    for (auto p = it->second.begin(); p != it->second.end(); ++p)
+    {
+      cudaFree(p->dev_indices);
+      cudaFree(p->dev_position);
+      cudaFree(p->dev_normal);
+      cudaFree(p->dev_texcoord0);
+      cudaFree(p->dev_diffuseTex);
+      cudaFree(p->dev_verticesOut);
+      //TODO: release other attributes and materials
+    }
+  }
+  ////////////
+  // cudaFree(dev_primitives);
+  // dev_primitives = NULL;
+  for(auto& object : objects)
+  {
+    free(object.dev_primitives);
+  }
+  cudaFree(dev_fragmentBuffer);
+  dev_fragmentBuffer = NULL;
+  cudaFree(dev_framebuffer);
+  dev_framebuffer = NULL;
+  cudaFree(dev_depth);
+  dev_depth = NULL;
+  cudaFree(dev_lights);
+  dev_lights = NULL;
+  checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..ba420f3 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -11,14 +11,110 @@
 #include <glm/glm.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 #include <glm/gtx/transform.hpp>
+#include <chrono>
 
-namespace tinygltf{
-	class Scene;
+namespace tinygltf {
+class Scene;
 }
 
+namespace {
+typedef unsigned short VertexIndex;
+typedef glm::vec3 VertexAttributePosition;
+typedef glm::vec3 VertexAttributeNormal;
+typedef glm::vec2 VertexAttributeTexcoord;
+typedef unsigned char TextureData;
+
+typedef unsigned char BufferByte;
+
+enum PrimitiveType { Point = 1, Line = 2, Triangle = 3 };
+
+struct VertexOut {
+  glm::vec4 pos;
+
+  // TODO: add new attributes to your VertexOut
+  // The attributes listed below might be useful,
+  // but always feel free to modify on your own
+
+  glm::vec3 eyePos; // eye space position used for shading
+  glm::vec3 eyeNor; // eye space normal used for shading, cuz normal will go
+                    // wrong after perspective transformation
+  // glm::vec3 col;
+  glm::vec2 texcoord0;
+  TextureData *dev_diffuseTex = NULL;
+  int texWidth, texHeight;
+  // ...
+};
+
+struct Primitive {
+  PrimitiveType primitiveType = Triangle; // C++ 11 init
+  VertexOut v[3];
+  bool backface_culled = false;
+};
+
+struct Fragment {
+  glm::vec3 color;
+
+  // TODO: add new attributes to your Fragment
+  // The attributes listed below might be useful,
+  // but always feel free to modify on your own
+
+  glm::vec3 eyePos; // eye space position used for shading
+  glm::vec3 eyeNor;
+  VertexAttributeTexcoord texcoord0;
+  TextureData *dev_diffuseTex;
+
+  // ...
+};
+
+struct PrimitiveDevBufPointers {
+  int primitiveMode; // from tinygltfloader macro
+  PrimitiveType primitiveType;
+  int numPrimitives;
+  int numIndices;
+  int numVertices;
+
+  // Vertex In, const after loaded
+  VertexIndex *dev_indices;
+  VertexAttributePosition *dev_position;
+  VertexAttributeNormal *dev_normal;
+  VertexAttributeTexcoord *dev_texcoord0;
+
+  // Materials, add more attributes when needed
+  TextureData *dev_diffuseTex;
+  int diffuseTexWidth;
+  int diffuseTexHeight;
+  // TextureData* dev_specularTex;
+  // TextureData* dev_normalTex;
+  // ...
+
+  // Vertex Out, vertex used for rasterization, this is changing every frame
+  VertexOut *dev_verticesOut;
+
+  // TODO: add more attributes when needed
+};
+} // namespace
+
+struct ObjectData {
+  int totalNumPrimitives = 0;
+  Primitive *dev_primitives = NULL;
+  bool is_copy = false;
+  glm::vec3 transformation = {0.0f, 0.0f, -10.0f};
+  bool is_deleted = false;
+};
+
+extern std::vector<ObjectData> objects;
+extern std::chrono::time_point<std::chrono::high_resolution_clock> clock_now;
 
 void rasterizeInit(int width, int height);
-void rasterizeSetBuffers(const tinygltf::Scene & scene);
+void rasterizeSetBuffers(const tinygltf::Scene &scene);
 
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
+void rasterize(uchar4 *pbo, const glm::mat4 &MVP, const glm::mat4 &MV,
+               const glm::mat3 MV_normal, glm::vec3 &camera_pos);
 void rasterizeFree();
+
+void set_scene(int index);
+void copy_object(int index);
+
+void zero_frame_buffer();
+
+void write_to_pbo(uchar4* pbo);
diff --git a/triangle_color.png b/triangle_color.png
new file mode 100644
index 0000000..395702f
Binary files /dev/null and b/triangle_color.png differ
diff --git a/truck1.png b/truck1.png
new file mode 100644
index 0000000..8690397
Binary files /dev/null and b/truck1.png differ
diff --git a/truck2.png b/truck2.png
new file mode 100644
index 0000000..5cf6d59
Binary files /dev/null and b/truck2.png differ
diff --git a/truck3.png b/truck3.png
new file mode 100644
index 0000000..08d03a1
Binary files /dev/null and b/truck3.png differ
diff --git a/truck4.png b/truck4.png
new file mode 100644
index 0000000..463c8fe
Binary files /dev/null and b/truck4.png differ
diff --git a/truck5.png b/truck5.png
new file mode 100644
index 0000000..17ec9a8
Binary files /dev/null and b/truck5.png differ