diff --git a/README.md b/README.md
index ae0896a..e08ea0a 100644
--- a/README.md
+++ b/README.md
@@ -1,184 +1,136 @@
--------------------------------------------------------------------------------
-CIS565: Project 4: CUDA Rasterizer
--------------------------------------------------------------------------------
-Fall 2014
--------------------------------------------------------------------------------
-Due Monday 10/27/2014 @ 12 PM
+﻿-------------------------------------------------------------------------------
+Software Rasterizer implemented using CUDA
 -------------------------------------------------------------------------------
 
 -------------------------------------------------------------------------------
-NOTE:
--------------------------------------------------------------------------------
-This project requires an NVIDIA graphics card with CUDA capability! Any card with CUDA compute capability 1.1 or higher will work fine for this project. For a full list of CUDA capable cards and their compute capability, please consult: http://developer.nvidia.com/cuda/cuda-gpus. If you do not have an NVIDIA graphics card in the machine you are working on, feel free to use any machine in the SIG Lab or in Moore100 labs. All machines in the SIG Lab and Moore100 are equipped with CUDA capable NVIDIA graphics cards. If this too proves to be a problem, please contact Patrick or Karl as soon as possible.
+This is a CUDA based software implementation of a standard rasterized graphics pipeline, very similar to the OpenGL pipeline.
+The following is a quick overview of the structure and features of my rasterizor. Implementation details will be explained later.
+###Pipe-line stages:
 
--------------------------------------------------------------------------------
-INTRODUCTION:
--------------------------------------------------------------------------------
-In this project, you will implement a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. In this project, you will implement vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and write the resulting fragments to a framebuffer. More information about the rasterized graphics pipeline can be found in the class slides and in your notes from CIS560.
+* Vertex Shading
+* Primitive Assembly
+* Back-Face Culling
+* Scanline rasterization
+* Fragment Shading
+* Render
 
-The basecode provided includes an OBJ loader and much of the mundane I/O and bookkeeping code. The basecode also includes some functions that you may find useful, described below. The core rasterization pipeline is left for you to implement.
+###Other features:
 
-You MAY NOT use ANY raycasting/raytracing AT ALL in this project, EXCEPT in the fragment shader step. One of the purposes of this project is to see how a rasterization pipeline can generate graphics WITHOUT the need for raycasting! Raycasting may only be used in the fragment shader effect for interesting shading results, but is absolutely not allowed in any other stages of the pipeline.
+* Mouse-based interactive camera
+* Mesh View
+* Vertices View
+* Color interpolation
 
-Also, you MAY NOT use OpenGL ANYWHERE in this project, aside from the given OpenGL code for drawing Pixel Buffer Objects to the screen. Use of OpenGL for any pipeline stage instead of your own custom implementation will result in an incomplete project.
+###Reulsts
 
-Finally, note that while this basecode is meant to serve as a strong starting point for a CUDA rasterizer, you are not required to use this basecode if you wish, and you may also change any part of the basecode specification as you please, so long as the final rendered result is correct.
+[Video Demo] (https://www.youtube.com/watch?v=_TUVPTLyZR0&list=UU8ix41TAtWn-RDD6gC8ml3A)
 
--------------------------------------------------------------------------------
-CONTENTS:
--------------------------------------------------------------------------------
-The Project4 root directory contains the following subdirectories:
-	
-* src/ contains the source code for the project. Both the Windows Visual Studio solution and the OSX makefile reference this folder for all source; the base source code compiles on OSX and Windows without modification.
-* objs/ contains example obj test files: cow.obj, cube.obj, tri.obj.
-* renders/ contains an example render of the given example cow.obj file with a z-depth fragment shader. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies needed for building and running on Windows 7.
+Tyra (200,000 faces)
+![](tyra1.jpg)
 
-The Windows and OSX versions of the project build and run exactly the same way as in Project0, Project1, and Project2.
+Cow (5,804 faces) - flat shading
+![] (cowFlat.jpg)
 
--------------------------------------------------------------------------------
-REQUIREMENTS:
--------------------------------------------------------------------------------
-In this project, you are given code for:
+Cow (5,804 faces) - phong shading
+![] (cowPhong.jpg)
+
+Donut - phong
+![] (donutPhong.jpg)
 
-* A library for loading/reading standard Alias/Wavefront .obj format mesh files and converting them to OpenGL style VBOs/IBOs
-* A suggested order of kernels with which to implement the graphics pipeline
-* Working code for CUDA-GL interop
+Donut - mesh
+![] (donutMesh.jpg)
 
-You will need to implement the following stages of the graphics pipeline and features:
+Armadillo (212,000 faces) - camera space normal
+![](armaNormal.jpg)
 
+Dragon (100,000 faces) - vertices
+![](dragonCloud.jpg)
+
+###Implementation Details
 * Vertex Shading
-* Primitive Assembly with support for triangle VBOs/IBOs
-* Perspective Transformation
-* Rasterization through either a scanline or a tiled approach
+
+Takes in vertex buffer, normal buffer and modelView matrix and transform vertices and normals to camera space. CUDA Kernel parallel on vertex i.e. each thread deals with one vertex.
+
+* Primitive Assembly
+
+Takes in vertex buffer, normal buffer and index buffer, then assembles them into primitives as an array of triangles, the normal for the triangle is calculated here and stored in the triangle.
+CUDA Kernel parallel on every three vertices i.e. each thread deals with 3 vertices (1 triangle).
+
+* Back-Face Culling
+
+Given the array of primitives, it removes triangles that faces away from the camera according to its normal. The removal uses Thrust::remove_if for maximum efficiency.
+
+#####Performance impact of back-face culling
+
+Model| No Back-face culling| with Back-face culling
+----- | ----- | -----
+bunny (5,800 faces)| 19 FPS | 26 FPS
+dragon (100,000 faces) |16 FPS | 22 FPS
+Tyra (200,000 faces) | 18 FPS | 24 FPS
+
+From the table, we can see, the performance gain of back-face culling is about 40%, independent of number of faces. This makes sense, about 50% of faces are back faces, and the 10% is probably a result of stream compacting
+the primitive array.
+
+* Scanline rasterization
+
+CUDA Kernel parallel on primitive i.e. each thread deals with one triangle. Several tasks are done in this stage. First of all, it transforms all all vertices and normals of the triangle into NDC space.
+Then, it calculates a bounding box of the triangle, and it loops through all fragments within this box. For each fragment, it calculates its barycentric coordiates in the triangle. Fragment will be discarded if
+it's Not in the triangle (according to its barycentric coordinates). If it's in the triangle, using the barycentric coordinates, a correctly interpolated color and normals will be calculated.
+Finally, it checks the pixel this fragment occupies in the depthbuffer and tests whether this fragment is in front of the one in the buffer, if yes, then we swap in this fragment. While this is happening this position
+in the depth buffer is locked. The lock is created using a bufferIsLockFlag array and CUDA atomic exchange function. This ensures there's no conflict of multiple threads trying to read and write to the same location in the
+depthbuffer.
+
 * Fragment Shading
-* A depth buffer for storing and depth testing fragments
-* Fragment to framebuffer writing
-* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
-
-You are also required to implement at least 3 of the following features:
-
-* Additional pipeline stages. Each one of these stages can count as 1 feature:
-   * Geometry shader
-   * Transformation feedback
-   * Back-face culling
-   * Scissor test
-   * Stencil test
-   * Blending
-
-IMPORTANT: For each of these stages implemented, you must also add a section to your README stating what the expected performance impact of that pipeline stage is, and real performance comparisons between your rasterizer with that stage and without.
-
-* Correct color interpolation between points on a primitive
-* Texture mapping WITH texture filtering and perspective correct texture coordinates
-* Support for additional primitices. Each one of these can count as HALF of a feature.
-   * Lines
-   * Line strips
-   * Triangle fans
-   * Triangle strips
-   * Points
-* Anti-aliasing
-* Order-independent translucency using a k-buffer
-* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
 
--------------------------------------------------------------------------------
-BASE CODE TOUR:
--------------------------------------------------------------------------------
-You will be working primarily in two files: rasterizeKernel.cu, and rasterizerTools.h. Within these files, areas that you need to complete are marked with a TODO comment. Areas that are useful to and serve as hints for optional features are marked with TODO (Optional). Functions that are useful for reference are marked with the comment LOOK.
+It calculates the local illumination (diffuse + specular + ambient) of the given fragment. Up to user's choice, it can either treats each triangle as flat or as smooth by using interpolated normal for each fragment.
+CUDA Kernel parallel on each fragmenti.e. each thread deals with one fragment in the depthbuffer.
 
-* rasterizeKernels.cu contains the core rasterization pipeline. 
-	* A suggested sequence of kernels exists in this file, but you may choose to alter the order of this sequence or merge entire kernels if you see fit. For example, if you decide that doing has benefits, you can choose to merge the vertex shader and primitive assembly kernels, or merge the perspective transform into another kernel. There is not necessarily a right sequence of kernels (although there are wrong sequences, such as placing fragment shading before vertex shading), and you may choose any sequence you want. Please document in your README what sequence you choose and why.
-	* The provided kernels have had their input parameters removed beyond basic inputs such as the framebuffer. You will have to decide what inputs should go into each stage of the pipeline, and what outputs there should be. 
+* Render
 
-* rasterizeTools.h contains various useful tools, including a number of barycentric coordinate related functions that you may find useful in implementing scanline based rasterization...
-	* A few pre-made structs are included for you to use, such as fragment and triangle. A simple rasterizer can be implemented with these structs as is. However, as with any part of the basecode, you may choose to modify, add to, use as-is, or outright ignore them as you see fit.
-	* If you do choose to add to the fragment struct, be sure to include in your README a rationale for why. 
+transfers data from depthbuffer to framebuffer. No other fancy stuff happens here, however, this stage will be very useful if there's translucent surface in the scene.
 
-You will also want to familiarize yourself with:
+###Other features:
 
-* main.cpp, which contains code that transfers VBOs/CBOs/IBOs to the rasterization pipeline. Interactive camera work will also have to be implemented in this file if you choose that feature.
-* utilities.h, which serves as a kitchen-sink of useful functions
+* Mouse-based interactive camera
 
--------------------------------------------------------------------------------
-SOME RESOURCES:
--------------------------------------------------------------------------------
-The following resources may be useful for this project:
-
-* High-Performance Software Rasterization on GPUs
-	* Paper (HPG 2011): http://www.tml.tkk.fi/~samuli/publications/laine2011hpg_paper.pdf
-	* Code: http://code.google.com/p/cudaraster/ Note that looking over this code for reference with regard to the paper is fine, but we most likely will not grant any requests to actually incorporate any of this code into your project.
-	* Slides: http://bps11.idav.ucdavis.edu/talks/08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
-* The Direct3D 10 System (SIGGRAPH 2006) - for those interested in doing geometry shaders and transform feedback.
-	* http://133.11.9.3/~takeo/course/2006/media/papers/Direct3D10_siggraph2006.pdf
-* Multi-Fragment Eﬀects on the GPU using the k-Buﬀer - for those who want to do a k-buffer
-	* http://www.inf.ufrgs.br/~comba/papers/2007/kbuffer_preprint.pdf
-* FreePipe: A Programmable, Parallel Rendering Architecture for Efficient Multi-Fragment Effects (I3D 2010)
-	* https://sites.google.com/site/hmcen0921/cudarasterizer
-* Writing A Software Rasterizer In Javascript:
-	* Part 1: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-1.html
-	* Part 2: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-2.html
+Click and drag to rotate, scroll to zoom in and out. 
 
--------------------------------------------------------------------------------
-NOTES ON GLM:
--------------------------------------------------------------------------------
-This project uses GLM, the GL Math library, for linear algebra. You need to know two important points on how GLM is used in this project:
+W,A,S,D to move camera position.
 
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but pre-Fermi cards do not play nice with GLM matrices. As such, in this project, GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is provided as multiplyMV() in intersections.h.
+F to toggle flat/phong shading.
 
--------------------------------------------------------------------------------
-README
--------------------------------------------------------------------------------
-All students must replace or augment the contents of this Readme.md in a clear 
-manner with the following:
+M to toggle shaded/mesh view.
 
-* A brief description of the project and the specific features you implemented.
-* At least one screenshot of your project running.
-* A 30 second or longer video of your project running.  To create the video you
-  can use http://www.microsoft.com/expression/products/Encoder4_Overview.aspx 
-* A performance evaluation (described in detail below).
+It's implemented using GLFW glfwGetMouseButton,glfwGetCursorPos for getting necessary inputs for setting up the camera according to mouse movement.
+
+* Mesh View
+
+This is done at the rasterization stage, where according to the barycentric coordinates of the fragment, it decides whether it's on an edge. Fragments that are not on any edge will be flagged as discarded.
+
+* Vertices View
+
+This is done in rasterization stage, where it only produces fragments that are the vertices of the given triangle.
+
+* Color interpolation
+
+In the rasterization stage, the color of the fragment is interpolated among the 3 vertex colors of the triangle using barycentric coordinates.
+
+
+
+###Performance Analysis
+Graphics Card: NVIDIA GeForce GTX 660
+
+model: stanford bunny ( 5,804 faces )
+
+![] (chart1.jpg)
+
+From the graph, it's clear that when objects are further from the camera the higher FPS. This is because each triangle takes up less pixels, thus the loop through all pixels for each primitives will be much faster.
+
+From other tests, I also found that the number of faces does not matter very much to the FPS, however the size of each triangle matters the most since the rasterization parallels on each triangle.
+Thus the rasterizor is most efficient when there's a large number of triangles and each covers small area on the screen and overall the sizes of triangles has as little variance as possible.
 
--------------------------------------------------------------------------------
-PERFORMANCE EVALUATION
--------------------------------------------------------------------------------
-The performance evaluation is where you will investigate how to make your CUDA
-programs more efficient using the skills you've learned in class. You must have
-performed at least one experiment on your code to investigate the positive or
-negative effects on performance. 
 
-We encourage you to get creative with your tweaks. Consider places in your code
-that could be considered bottlenecks and try to improve them. 
 
-Each student should provide no more than a one page summary of their
-optimizations along with tables and or graphs to visually explain any
-performance differences.
 
--------------------------------------------------------------------------------
-THIRD PARTY CODE POLICY
--------------------------------------------------------------------------------
-* Use of any third-party code must be approved by asking on Piazza.  If it is approved, all students are welcome to use it.  Generally, we approve use of third-party code that is not a core part of the project.  For example, for the ray tracer, we would approve using a third-party library for loading models, but would not approve copying and pasting a CUDA function for doing refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another student's code, is an academic integrity violation, and will result in you receiving an F for the semester.
 
--------------------------------------------------------------------------------
-SELF-GRADING
--------------------------------------------------------------------------------
-* On the submission date, email your grade, on a scale of 0 to 100, to Liam, harmoli+cis565@seas.upenn.edu, with a one paragraph explanation.  Be concise and realistic.  Recall that we reserve 30 points as a sanity check to adjust your grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We hope to only use this in extreme cases when your grade does not realistically reflect your work - it is either too high or too low.  In most cases, we plan to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as the path tracer.  We will determine the weighting at the end of the semester based on the size of each project.
-
----
-SUBMISSION
----
-As with the previous project, you should fork this project and work inside of
-your fork. Upon completion, commit your finished project back to your fork, and
-make a pull request to the master repository.  You should include a README.md
-file in the root directory detailing the following
-
-* A brief description of the project and specific features you implemented
-* At least one screenshot of your project running.
-* A link to a video of your raytracer running.
-* Instructions for building and running your project if they differ from the
-  base code.
-* A performance writeup as detailed above.
-* A list of all third-party code used.
-* This Readme file edited as described above in the README section.
 
diff --git a/Thumbs.db b/Thumbs.db
new file mode 100644
index 0000000..ea1fbb2
Binary files /dev/null and b/Thumbs.db differ
diff --git a/armaNormal.jpg b/armaNormal.jpg
new file mode 100644
index 0000000..3c4e3a1
Binary files /dev/null and b/armaNormal.jpg differ
diff --git a/bunny1.jpg b/bunny1.jpg
new file mode 100644
index 0000000..9891aca
Binary files /dev/null and b/bunny1.jpg differ
diff --git a/chart1.jpg b/chart1.jpg
new file mode 100644
index 0000000..ae0b385
Binary files /dev/null and b/chart1.jpg differ
diff --git a/colorInterp.jpg b/colorInterp.jpg
new file mode 100644
index 0000000..b282075
Binary files /dev/null and b/colorInterp.jpg differ
diff --git a/cowFlat.jpg b/cowFlat.jpg
new file mode 100644
index 0000000..f48d528
Binary files /dev/null and b/cowFlat.jpg differ
diff --git a/cowPhong.jpg b/cowPhong.jpg
new file mode 100644
index 0000000..1d5f821
Binary files /dev/null and b/cowPhong.jpg differ
diff --git a/donutFlat.jpg b/donutFlat.jpg
new file mode 100644
index 0000000..93648dc
Binary files /dev/null and b/donutFlat.jpg differ
diff --git a/donutMesh.jpg b/donutMesh.jpg
new file mode 100644
index 0000000..2b731e0
Binary files /dev/null and b/donutMesh.jpg differ
diff --git a/donutPhong.jpg b/donutPhong.jpg
new file mode 100644
index 0000000..15bb687
Binary files /dev/null and b/donutPhong.jpg differ
diff --git a/dragonCloud.jpg b/dragonCloud.jpg
new file mode 100644
index 0000000..f027fa6
Binary files /dev/null and b/dragonCloud.jpg differ
diff --git a/external/src/objUtil/obj.cpp b/external/src/objUtil/obj.cpp
index 22a33aa..906acc2 100644
--- a/external/src/objUtil/obj.cpp
+++ b/external/src/objUtil/obj.cpp
@@ -93,7 +93,8 @@ void obj::buildVBOs(){
 	for(int i=0; i<IBOvec.size(); i++){
 		ibo[i] = IBOvec[i];
 	}
-	setColor(glm::vec3(.4,.4,.4));
+
+	setColor(glm::vec3(.9f,0.6f,0.3f));
 }
 
 void obj::compareMaxMin(float x, float y, float z){
diff --git a/renders/Thumbs.db b/renders/Thumbs.db
new file mode 100644
index 0000000..07e023c
Binary files /dev/null and b/renders/Thumbs.db differ
diff --git a/renders/debug normal.jpg b/renders/debug normal.jpg
new file mode 100644
index 0000000..0ccf50a
Binary files /dev/null and b/renders/debug normal.jpg differ
diff --git a/src/main.cpp b/src/main.cpp
index 13d8e67..ef85dec 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -32,19 +32,128 @@ int main(int argc, char** argv){
   frame = 0;
   seconds = time (NULL);
   fpstracker = 0;
-
   // Launch CUDA/GL
   if (init(argc, argv)) {
     // GLFW main loop
     mainLoop();
   }
-
   return 0;
 }
 
+
+void scroll_callback(GLFWwindow* window, double xoffset, double yoffset)
+{
+	mouseScrollOffset += yoffset;
+}
 void mainLoop() {
-  while(!glfwWindowShouldClose(window)){
+ 
+
+ while(!glfwWindowShouldClose(window)){
+
+	//camera rotation, zoom control using mouse
+	double * mouseX = new double;
+	double * mouseY = new double;
+
+	if(glfwGetMouseButton(window,GLFW_MOUSE_BUTTON_LEFT) == GLFW_PRESS)
+	{
+		glfwGetCursorPos(window,mouseX,mouseY);
+		if(mouseButtonIsDown == false)
+		{
+			mouseButtonIsDown = true;
+			mouseClickedX = *mouseX;
+			mouseClickedY = *mouseY;
+		}
+		mouseDeltaX = *mouseX - mouseClickedX;
+		mouseDeltaY = *mouseY - mouseClickedY;
+
+	}
+
+	else
+	{
+		if(mouseButtonIsDown == true)
+		{
+			rotationX += mouseDeltaX;
+			rotationY += mouseDeltaY;
+			mouseButtonIsDown = false;
+			mouseDeltaX = 0.0f;
+			mouseDeltaY = 0.0f;
+		}
+
+	}
+	delete mouseX,mouseY;
+
+	//camera movement control using keyboard
+	if(glfwGetKey(window,GLFW_KEY_W) == GLFW_PRESS)
+	{
+		deltaZ -= cameraMovementIncrement;
+	}
+	if(glfwGetKey(window,GLFW_KEY_S) == GLFW_PRESS)
+	{
+		deltaZ += cameraMovementIncrement;
+	}
+	if(glfwGetKey(window,GLFW_KEY_A) == GLFW_PRESS)
+	{
+		deltaX -= cameraMovementIncrement;
+	}
+	if(glfwGetKey(window,GLFW_KEY_D) == GLFW_PRESS)
+	{
+		deltaX += cameraMovementIncrement;
+	}
+	
+	if(glfwGetKey(window,GLFW_KEY_F) == GLFW_PRESS)
+	{
+		isFkeyDown = true;
+	}
+	if(glfwGetKey(window,GLFW_KEY_F) == GLFW_RELEASE)
+	{
+		if(isFkeyDown)
+		{		
+			if(!isFlatShading) isFlatShading = 1;
+			else isFlatShading = 0;
+		}
+		isFkeyDown = false;
+	}
+
+	if(glfwGetKey(window,GLFW_KEY_M) == GLFW_PRESS)
+	{
+		isMkeyDown = true;
+	}
+	if(glfwGetKey(window,GLFW_KEY_M) == GLFW_RELEASE)
+	{
+		if(isMkeyDown)
+		{		
+			if(!isMeshView) isMeshView = 1;
+			else isMeshView = 0;
+		}
+		isMkeyDown = false;
+	}
+
+
+	//set up transformations
+	float fov_rad = FOV_DEG * PI / 180.0f;
+	float AR = width / height;
+
+	//glm::mat4 ModelTransform =utilityCore::buildTransformationMatrix(glm::vec3(0.0f),glm::vec3(0.0f,0.0f,0.0f),glm::vec3(1.0f));
+	glm::mat4 ModelTransform =utilityCore::buildTransformationMatrix(glm::vec3(0.0f),glm::vec3(-(rotationY + mouseDeltaY),- (rotationX + mouseDeltaX + 10.0f),0.0f),glm::vec3(0.6f));
+
+	glm::mat4 cameraAimTransform = utilityCore::buildTransformationMatrix(glm::vec3(0.0f),glm::vec3(0.0f),glm::vec3(1.0f));
+	//glm::mat4 cameraAimTransform = utilityCore::buildTransformationMatrix(glm::vec3(0.0f),glm::vec3(-(rotationY + mouseDeltaY),- (rotationX + mouseDeltaX + 10.0f),0.0f),glm::vec3(1.0f));
+	glm::mat4 cameraPosTransform = utilityCore::buildTransformationMatrix(glm::vec3(0.0f + deltaX,-.25f+ deltaZ,(2.0f  + MOUSE_SCROLL_SPEED * mouseScrollOffset)),glm::vec3(0.0f),glm::vec3(1.0f));
+	glm::mat4 ViewTransform = cameraAimTransform *cameraPosTransform;
+	//glm::mat4 ViewTransform =utilityCore::buildTransformationMatrix(glm::vec3(0.0f + deltaX,-.25f,2.0f + deltaZ + MOUSE_SCROLL_SPEED * mouseScrollOffset),glm::vec3(-(rotationY + mouseDeltaY),- (rotationX + mouseDeltaX),0.0f),glm::vec3(1.0f));
+	
+	glmViewTransform = ViewTransform;
+	glmProjectionTransform = glm::perspective((float)45.0f,AR, 1.0f,50.0f);
+	glmMVtransform = ViewTransform * ModelTransform;
+
+	//construct light
+	Light.position = glm::vec3(7.0f,2.0f, -10.0f);
+	Light.diffColor = glm::vec3(1.0f);
+	Light.specColor = glm::vec3(1.0f);
+	Light.specExp = 20;
+	Light.ambColor = glm::vec3(0.2f,0.6f,0.3f);
     glfwPollEvents();
+
     runCuda();
 
     time_t seconds2 = time (NULL);
@@ -68,6 +177,7 @@ void mainLoop() {
     glDrawElements(GL_TRIANGLES, 6,  GL_UNSIGNED_SHORT, 0);
     glfwSwapBuffers(window);
   }
+
   glfwDestroyWindow(window);
   glfwTerminate();
 }
@@ -84,22 +194,23 @@ void runCuda(){
   vbo = mesh->getVBO();
   vbosize = mesh->getVBOsize();
 
-  float newcbo[] = {0.0, 1.0, 0.0, 
-                    0.0, 0.0, 1.0, 
-                    1.0, 0.0, 0.0};
-  cbo = newcbo;
-  cbosize = 9;
+  cbo = mesh->getCBO();
+  cbosize = mesh->getCBOsize();
 
   ibo = mesh->getIBO();
   ibosize = mesh->getIBOsize();
 
+  nbo = mesh->getNBO();
+  nbosize = mesh->getNBOsize();
+
   cudaGLMapBufferObject((void**)&dptr, pbo);
-  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize);
+  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize,nbo,nbosize, glmViewTransform, glmProjectionTransform,glmMVtransform,Light, isFlatShading,isMeshView);
   cudaGLUnmapBufferObject(pbo);
 
   vbo = NULL;
   cbo = NULL;
   ibo = NULL;
+  nbo = NULL;
 
   frame++;
   fpstracker++;
@@ -119,7 +230,7 @@ bool init(int argc, char* argv[]) {
 
   width = 800;
   height = 800;
-  window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
+  window = glfwCreateWindow(width, height, "CUDA rasterizer", NULL, NULL);
   if (!window){
       glfwTerminate();
       return false;
@@ -145,6 +256,8 @@ bool init(int argc, char* argv[]) {
   glUseProgram(passthroughProgram);
   glActiveTexture(GL_TEXTURE0);
 
+  glfwSetScrollCallback(window, scroll_callback);
+
   return true;
 }
 
diff --git a/src/main.h b/src/main.h
index 8999110..1002c20 100644
--- a/src/main.h
+++ b/src/main.h
@@ -3,6 +3,7 @@
 
 #ifndef MAIN_H
 #define MAIN_H
+#include <stdlib.h>
 
 #include <GL/glew.h>
 #include <GLFW/glfw3.h>
@@ -12,10 +13,10 @@
 #include <fstream>
 #include <glm/glm.hpp>
 #include <glslUtil/glslUtility.hpp>
+#include "glm/gtc/matrix_transform.hpp"
 #include <iostream>
 #include <objUtil/objloader.h>
 #include <sstream>
-#include <stdlib.h>
 #include <string>
 #include <time.h>
 
@@ -25,6 +26,36 @@
 
 using namespace std;
 
+#define FOV_DEG 30
+#define MOUSE_SCROLL_SPEED 0.1f
+
+light Light;
+
+//transformations
+glm::mat4 glmViewTransform;
+glm::mat4 glmProjectionTransform;
+glm::mat4 glmMVtransform;
+
+//mouse control stuff
+bool mouseButtonIsDown = false;
+float mouseScrollOffset = 0.0f;
+double mouseClickedX = 0.0f;
+double mouseClickedY = 0.0f;
+double rotationX = 0.0f;
+double rotationY = 0.0f;
+double mouseDeltaX = 0.0f;
+double mouseDeltaY = 0.0f;
+//toggle view
+bool isFkeyDown = false;
+int isFlatShading = false;
+bool isMkeyDown = false;
+int isMeshView = false;
+
+//keyboard control
+double deltaX = 0.0f;
+double deltaZ = 0.0f;
+double cameraMovementIncrement = 0.015f;
+
 //-------------------------------
 //------------GL STUFF-----------
 //-------------------------------
@@ -49,6 +80,8 @@ float* cbo;
 int cbosize;
 int* ibo;
 int ibosize;
+float* nbo;
+int nbosize;
 
 //-------------------------------
 //----------CUDA STUFF-----------
diff --git a/src/rasterizeKernels.cu b/src/rasterizeKernels.cu
index 10b0000..975085e 100644
--- a/src/rasterizeKernels.cu
+++ b/src/rasterizeKernels.cu
@@ -8,21 +8,49 @@
 #include "rasterizeKernels.h"
 #include "rasterizeTools.h"
 
+#include <thrust/remove.h>
+#include <thrust/count.h>
+#include <thrust/device_ptr.h>
+
+#define BLOCK_SIZE 16
+#define DEBUG_VERTICES 0
+#define DEBUG_NORMALS 0
+#define DEBUG_DEPTH 0
+#define SPECULAR_EXP 6
+#define COLOR_INTERPOLATION_MODE 0
+#define BACKFACE_CULLING 0
+#define SHADING_RATE 1.0f
+
 glm::vec3* framebuffer;
 fragment* depthbuffer;
+int * depthBufferLock;
 float* device_vbo;
 float* device_cbo;
 int* device_ibo;
+float * device_nbo;
 triangle* primitives;
 
+cudaMat4 * projectionTransform;
+cudaMat4 * MVtransform;
+cudaMat4 * MVPtransform;
+
 void checkCUDAError(const char *msg) {
   cudaError_t err = cudaGetLastError();
   if( cudaSuccess != err) {
     fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 
-    exit(EXIT_FAILURE); 
+   // exit(EXIT_FAILURE); 
   }
 } 
 
+//fast initializor for int array
+__global__ void initiateArray(int * array, int val, int num)
+{
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if(index < num)
+	{
+		array[index] = val;
+	}
+}
 //Handy dandy little hashing function that provides seeds for random number generation
 __host__ __device__ unsigned int hash(unsigned int a){
     a = (a+0x7ed55d16) + (a<<12);
@@ -88,8 +116,6 @@ __global__ void clearDepthBuffer(glm::vec2 resolution, fragment* buffer, fragmen
     int index = x + (y * resolution.x);
     if(x<=resolution.x && y<=resolution.y){
       fragment f = frag;
-      f.position.x = x;
-      f.position.y = y;
       buffer[index] = f;
     }
 }
@@ -128,34 +154,283 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
   }
 }
 
-//TODO: Implement a vertex shader
-__global__ void vertexShadeKernel(float* vbo, int vbosize){
+//vertex shader
+__global__ void vertexShadeKernel(float* vbo, int vbosize, float * nbo, int nbosize, cudaMat4 * MV){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<vbosize/3){
+
+	  glm::vec4 newP = multiplyMV4(*MV,glm::vec4(vbo[3*index],vbo[3*index + 1],vbo[3*index +2], 1.0f)); 
+	  glm::vec4 newN = glm::normalize(multiplyMV4(*MV,glm::vec4(nbo[3*index],nbo[3*index + 1],nbo[3*index +2], 0.0f))); 
+	  
+	  vbo[3*index] = newP.x;
+	  vbo[3*index+1] = newP.y;
+	  vbo[3*index+2] = newP.z;
+
+	  nbo[3*index] = newN.x;
+	  nbo[3*index+1] = newN.y;
+	  nbo[3*index+2] = newN.z;
   }
 }
 
-//TODO: Implement primative assembly
-__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives){
+//Primative assembly
+__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float * nbo, int nbosize, triangle* primitives){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   int primitivesCount = ibosize/3;
   if(index<primitivesCount){
+
+	  int i0,i1,i2;
+
+	  i0 = 3 * ibo[3*index];
+	  i1 = 3 * ibo[3*index + 1];
+	  i2 = 3 * ibo[3*index + 2];
+
+	  primitives[index].p0 = glm::vec3(vbo[i0],vbo[i0+1],vbo[i0+2]);
+	  primitives[index].p1 = glm::vec3(vbo[i1],vbo[i1+1],vbo[i1+2]);
+	  primitives[index].p2 = glm::vec3(vbo[i2],vbo[i2+1],vbo[i2+2]);
+#if(COLOR_INTERPOLATION_MODE)
+	  primitives[index].c0 = glm::vec3(1.0f,0.0f,0.0f);
+	  primitives[index].c1 = glm::vec3(0.0f,1.0f,0.0f);
+	  primitives[index].c2 = glm::vec3(0.0f,0.0f,1.0f);
+#else
+	  primitives[index].c0 = glm::vec3(cbo[i0],cbo[i0+1],cbo[i0+2]);
+	  primitives[index].c1 = glm::vec3(cbo[i1],cbo[i1+1],cbo[i1+2]);
+	  primitives[index].c2 = glm::vec3(cbo[i2],cbo[i2+1],cbo[i2+2]);
+#endif
+
+	  primitives[index].n0 = glm::vec3(nbo[i0],nbo[i0+1],nbo[i0+2]);
+	  primitives[index].n1 = glm::vec3(nbo[i1],nbo[i1+1],nbo[i1+2]);
+	  primitives[index].n2 = glm::vec3(nbo[i2],nbo[i2+1],nbo[i2+2]);
+
+	  primitives[index].flatNormal = glm::normalize(glm::cross(primitives[index].p1 - primitives[index].p0, primitives[index].p2 - primitives[index].p1));
+	  
+
   }
 }
 
-//TODO: Implement a rasterization method, such as scanline.
-__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if(index<primitivesCount){
-  }
+struct isBackFace
+{
+	__host__ __device__ bool operator()(const triangle tri)
+	{
+		if(tri.flatNormal.z > 0.001f) return true;
+		return false;
+	}
+};
+
+//back-face cull
+int backFaceCull(triangle * primitives, int num)
+{
+	thrust::device_ptr<triangle> dev_primitives(primitives);
+	int ret = thrust::remove_if(dev_primitives,dev_primitives + num, isBackFace()) - dev_primitives;
+	return ret;
+}
+
+//rasterization
+__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, int * depthBufferLock, glm::vec2 resolution, cudaMat4 * Ptransform, int isFlatShading, int isMeshView){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if(index<primitivesCount){
+		triangle originalTri =  primitives[index];
+		triangle tri = primitives[index];
+		
+		//to clip coordinates
+		glm::vec4 clipP0 = multiplyMV4(*Ptransform,glm::vec4(tri.p0,1.0f));
+		glm::vec4 clipP1 = multiplyMV4(*Ptransform,glm::vec4(tri.p1,1.0f));
+		glm::vec4 clipP2 = multiplyMV4(*Ptransform,glm::vec4(tri.p2,1.0f));
+		//to NDC
+		tri.p0 = glm::vec3(clipP0.x / clipP0.w,clipP0.y / clipP0.w,clipP0.z / clipP0.w);
+		tri.p1 = glm::vec3(clipP1.x / clipP1.w,clipP1.y / clipP1.w,clipP1.z / clipP1.w);
+		tri.p2 = glm::vec3(clipP2.x / clipP2.w,clipP2.y / clipP2.w,clipP2.z / clipP2.w);
+
+#if(DEBUG_VERTICES)
+		//Only display vertices
+		int x0,y0,x1,y1,x2,y2, P0,P1,P2;
+		x0 = tri.p0.x * 0.5f * resolution.x + 0.5f * resolution.x;
+		x1 = tri.p1.x * 0.5f * resolution.x + 0.5f * resolution.x;
+		x2 = tri.p2.x * 0.5f * resolution.x + 0.5f * resolution.x;
+
+		y0 = tri.p0.y * 0.5f * resolution.y + 0.5f * resolution.y;
+		y1 = tri.p1.y * 0.5f * resolution.y + 0.5f * resolution.y;
+		y2 = tri.p2.y * 0.5f * resolution.y + 0.5f * resolution.y;
+
+		//reversed y 
+		P0 = x0 + (y0) * resolution.x;
+		P1 = x1 + (y1) * resolution.x;
+		P2 = x2 + (y2) * resolution.x;
+
+		int totalPixel = resolution.x * resolution.y;
+		  
+		if(x0 > 0 && y0 > 0 && x0 < resolution.x && y0 < resolution.y) depthbuffer[P0].color = tri.c0;
+		if(x1 > 0 && y1 > 0 && x1 < resolution.x && y1 < resolution.y) depthbuffer[P1].color = tri.c1;
+		if(x2 > 0 && y2 > 0 && x2 < resolution.x && y2 < resolution.y) depthbuffer[P2].color = tri.c2;
+		return;
+
+#else
+
+		//Full rasterization
+		float epsilon = 0.035f; // for mesh line
+
+		int totalPixel = resolution.x * resolution.y;
+		float halfResoX =  0.5f * (float) resolution.x;
+		float halfResoY =  0.5f * (float) resolution.y;
+
+		glm::vec3 Min,Max;
+		getAABBForTriangle(tri,Min,Max);
+		float pixelWidth = 1.0f/(float) resolution.x;
+		float pixelHeight = 1.0f/(float) resolution.y;
+
+
+		//loop thru all pixels in the bounding box
+		for(float i = 0;i < (Max.x - Min.x)/pixelWidth + 1.0f; i+=SHADING_RATE)
+		{
+			for(float j = 0;j <(Max.y - Min.y)/pixelHeight + 1.0f; j+=SHADING_RATE)
+			{
+
+				glm::vec2 pixelPos = glm::vec2(Min.x + i * pixelWidth, Min.y + j * pixelHeight);
+				glm::vec3 pixelBaryPos = calculateBarycentricCoordinate(tri, pixelPos);
+				
+				fragment frag;
+				frag.isEmpty = false;
+				frag.isFlat = false;
+
+				//not in triangle
+				if(!isBarycentricCoordInBounds(pixelBaryPos))
+				{
+					continue;
+				}
+				//in triangle
+				else
+				{
+					int x,y, pixelIndex;
+					//viewport transformation
+					x = pixelPos.x * halfResoX + halfResoX;
+					y = pixelPos.y  * halfResoY+ halfResoY;
+					if(x < 0 || y < 0 || x > resolution.x || y > resolution.y) continue;
+
+					pixelIndex = x + y * resolution.x;
+
+					//calculate fragment positions in both NDC and camera space
+					frag.position = pixelBaryPos.x * tri.p0 + pixelBaryPos.y * tri.p1 + pixelBaryPos.z * tri.p2;
+					frag.cameraSpacePosition = pixelBaryPos.x * originalTri.p0 + pixelBaryPos.y * originalTri.p1 + pixelBaryPos.z * originalTri.p2;
+
+					//calculate normal, either flat or Phong
+					if(isFlatShading) frag.normal = originalTri.flatNormal; 
+					else frag.normal = pixelBaryPos.x * originalTri.n0 + pixelBaryPos.y * originalTri.n1 + pixelBaryPos.z * originalTri.n2;
+
+					frag.color = pixelBaryPos.x * tri.c0 + pixelBaryPos.y * tri.c1 + pixelBaryPos.z * tri.c2;
+
+					if(isMeshView)
+					{
+						float Lyz = glm::length(tri.p1 - tri.p2);float Lxz = glm::length(tri.p0 - tri.p2);float Lxy = glm::length(tri.p0 - tri.p1);
+						if(abs(pixelBaryPos.x)  < epsilon ||abs(pixelBaryPos.y)  < epsilon ||abs(pixelBaryPos.z)  < epsilon)
+						{
+							frag.color = glm::vec3(0.0f,0.0f,1.0f);
+							frag.isFlat = true;
+						}
+
+						else
+						{
+							frag.isEmpty = true;
+							frag.position.z = - 10000.0f;
+						}
+					}
+
+					if(frag.isEmpty) frag.position.z = - 10000.0f;
+
+					//test depth in the buffer and swap if greater, have to lock when testing
+					bool shouldWait = true;
+					while(shouldWait)
+					{
+						if( atomicExch(&depthBufferLock[pixelIndex],1) == 0)
+						{
+
+							if(frag.position.z > depthbuffer[pixelIndex].position.z) //TODO change to atomic compare
+							{
+								depthbuffer[pixelIndex] = frag;
+							}
+							shouldWait = false;
+							depthBufferLock[pixelIndex] = 0;
+						}
+					}
+
+
+				}
+			}
+		}
+#endif
+	}
 }
 
-//TODO: Implement a fragment shader
-__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution){
+
+//fragment shader
+__global__ void fragmentShadeKernel(fragment* depthbuffer, light rawLight, glm::vec2 resolution,cudaMat4 viewTransform){
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
   if(x<=resolution.x && y<=resolution.y){
+
+	  fragment f = depthbuffer[index];
+
+//debug views
+#if(DEBUG_VERTICES) 
+	  return;
+#endif
+#if(DEBUG_NORMALS)
+	  if(f.isEmpty) depthbuffer[index].color = glm::vec3(0.0f);
+
+	  else 
+	  {
+		  glm::vec3 normalColor = depthbuffer[index].normal;
+		  normalColor.z *= -1.0f;
+	      depthbuffer[index].color = normalColor;
+	  }
+	  return;
+#endif
+#if(DEBUG_DEPTH)
+	  if(f.isEmpty) depthbuffer[index].color = glm::vec3(0.0f);
+	  else depthbuffer[index].color = glm::vec3((1.5f - depthbuffer[index].cameraSpacePosition.z)/1.5f);
+	  return;
+#endif
+
+	  float diffCoe = 0.60f;
+	  float specCoe = 0.35f;
+	  float ambCoe = glm::clamp(1.0f - diffCoe - specCoe,0.0f,1.0f);
+
+	  light Light = rawLight;
+
+	  Light.position = multiplyMV(viewTransform, glm::vec4(rawLight.position,1.0f)); 
+
+
+	  if(f.isEmpty) 
+	  {
+		  depthbuffer[index].color = Light.ambColor;
+		  return;
+	  }
+
+	  if(f.isFlat)
+	  {
+		  depthbuffer[index].color = f.coverage * f.color + (1.0f - f.coverage) * Light.ambColor;
+		  return;
+	  }
+
+	  glm::vec3 surfacePos = f.cameraSpacePosition;
+	  glm::vec3 surfaceNormal = f.normal;
+
+	  glm::vec3 L = glm::normalize(Light.position - surfacePos);
+
+	  //diffuse shading
+	  float diffCom = glm::dot(L,surfaceNormal);
+	  diffCom = glm::clamp(diffCom,0.0f,1.0f);
+
+	  //specular
+	  glm::vec3 R = glm::normalize(glm::reflect(-L,surfaceNormal));
+	  glm::vec3 V = - glm::normalize(surfacePos);
+
+	  float specCom;
+	  if(glm::dot(L,surfaceNormal) <0.0f) specCom = 0.0f; 
+	  else specCom = pow( glm::dot( V, R), Light.specExp);
+	  specCom = glm::clamp(specCom,0.0f,1.0f);
+
+	  depthbuffer[index].color = (f.coverage * diffCoe * diffCom * Light.diffColor * f.color  +  specCoe * specCom * Light.specColor + ambCoe * Light.ambColor) + (1.0f - f.coverage) * Light.ambColor;
+
   }
 }
 
@@ -171,88 +446,147 @@ __global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* f
   }
 }
 
-// Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
-void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize){
+// Wrapper for the __global__ call that sets up the kernel calls and memory management
+void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float * nbo, int nbosize,glm::mat4 glmViewTransform,glm::mat4 glmProjectionTransform, glm::mat4 glmMVtransform,light Light, int isFlatShading, int isMeshView){
+	
+	projectionTransform = new cudaMat4;
+	MVtransform = new cudaMat4;
+	MVPtransform = new cudaMat4;
+
+	cudaMat4 * dev_projectionTransform;
+	cudaMat4 * dev_MVtransform;
+	cudaMat4 * dev_MVPtransform;
+  
+	*projectionTransform = utilityCore::glmMat4ToCudaMat4(glmProjectionTransform);
+	*MVtransform = utilityCore::glmMat4ToCudaMat4(glmMVtransform);
+	*MVPtransform =utilityCore::glmMat4ToCudaMat4(glmProjectionTransform * glmMVtransform);
+
+	cudaMalloc((void**) & dev_projectionTransform, sizeof(cudaMat4));
+	cudaMalloc((void**) & dev_MVtransform, sizeof(cudaMat4));
+	cudaMalloc((void**) & dev_MVPtransform, sizeof(cudaMat4));
+
+	cudaMemcpy(dev_projectionTransform,projectionTransform,sizeof(cudaMat4),cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_MVtransform,MVtransform,sizeof(cudaMat4),cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_MVPtransform,MVPtransform,sizeof(cudaMat4),cudaMemcpyHostToDevice);
+
+	cudaMat4 inverseMV = utilityCore::glmMat4ToCudaMat4(glm::inverse(glmMVtransform));
+	cudaMat4 inverseViewTransform = utilityCore::glmMat4ToCudaMat4(glm::inverse(glmViewTransform));
+	cudaMat4 viewTransform = utilityCore::glmMat4ToCudaMat4(glmViewTransform);
+
 
-  // set up crucial magic
-  int tileSize = 8;
-  dim3 threadsPerBlock(tileSize, tileSize);
-  dim3 fullBlocksPerGrid((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
+	// set up thread configuration
+	int tileSize = 8;
+	dim3 threadsPerBlock(tileSize, tileSize);
+	dim3 fullBlocksPerGrid((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
 
-  //set up framebuffer
-  framebuffer = NULL;
-  cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
+	//set up framebuffer
+	framebuffer = NULL;
+	cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
   
-  //set up depthbuffer
-  depthbuffer = NULL;
-  cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
+	//set up depthbuffer
+	depthbuffer = NULL;
+	cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
 
-  //kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
-  clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
+	//kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
+	clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
   
-  fragment frag;
-  frag.color = glm::vec3(0,0,0);
-  frag.normal = glm::vec3(0,0,0);
-  frag.position = glm::vec3(0,0,-10000);
-  clearDepthBuffer<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer,frag);
-
-  //------------------------------
-  //memory stuff
-  //------------------------------
-  primitives = NULL;
-  cudaMalloc((void**)&primitives, (ibosize/3)*sizeof(triangle));
-
-  device_ibo = NULL;
-  cudaMalloc((void**)&device_ibo, ibosize*sizeof(int));
-  cudaMemcpy( device_ibo, ibo, ibosize*sizeof(int), cudaMemcpyHostToDevice);
-
-  device_vbo = NULL;
-  cudaMalloc((void**)&device_vbo, vbosize*sizeof(float));
-  cudaMemcpy( device_vbo, vbo, vbosize*sizeof(float), cudaMemcpyHostToDevice);
-
-  device_cbo = NULL;
-  cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
-  cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
-
-  tileSize = 32;
-  int primitiveBlocks = ceil(((float)vbosize/3)/((float)tileSize));
-
-  //------------------------------
-  //vertex shader
-  //------------------------------
-  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //primitive assembly
-  //------------------------------
-  primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
-  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //rasterization
-  //------------------------------
-  rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives, ibosize/3, depthbuffer, resolution);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //fragment shader
-  //------------------------------
-  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //write fragments to framebuffer
-  //------------------------------
-  render<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer, framebuffer);
-  sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, resolution, framebuffer);
-
-  cudaDeviceSynchronize();
-
-  kernelCleanup();
-
-  checkCUDAError("Kernel failed!");
+	fragment frag;
+	frag.color = glm::vec3(0,0,0);
+	frag.normal = glm::vec3(0,0,0);
+	frag.position = glm::vec3(0,0,-10000);
+	frag.cameraSpacePosition = glm::vec3(0,0,-10000);
+	frag.isEmpty = true;
+	frag.isFlat = false;
+	frag.coverage = 1.0f;
+	clearDepthBuffer<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer,frag);
+
+	//------------------------------
+	//memory stuff
+	//------------------------------
+	primitives = NULL;
+	cudaMalloc((void**)&primitives, (ibosize/3)*sizeof(triangle));
+
+	device_ibo = NULL;
+	cudaMalloc((void**)&device_ibo, ibosize*sizeof(int));
+	cudaMemcpy( device_ibo, ibo, ibosize*sizeof(int), cudaMemcpyHostToDevice);
+
+	device_vbo = NULL;
+	cudaMalloc((void**)&device_vbo, vbosize*sizeof(float));
+	cudaMemcpy( device_vbo, vbo, vbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	device_cbo = NULL;
+	cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
+	cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	device_nbo = NULL;
+	cudaMalloc((void**)&device_nbo, nbosize*sizeof(float));
+	cudaMemcpy( device_nbo, nbo, nbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	int depthBufferLockSize = resolution.x * resolution.y;
+	depthBufferLock = NULL;
+	cudaMalloc((void**)&depthBufferLock, depthBufferLockSize * sizeof(int));
+	initiateArray<<<dim3(ceil((float)depthBufferLockSize/((float)512))),dim3(512)>>>(depthBufferLock,0,depthBufferLockSize);
+
+	int primitiveNum(ibosize/3);
+	tileSize = 32;
+	int primitiveBlocks = ceil(((float)vbosize/((float)3))/((float)tileSize));
+
+	//------------------------------
+	//vertex shader
+	//------------------------------
+	vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize,device_nbo,nbosize,dev_MVtransform);
+	checkCUDAError("vertex shader failed!");
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//primitive assembly
+	//------------------------------
+	primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
+	primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, device_nbo,nbosize, primitives);
+	checkCUDAError("primitive assembly failed!");
+
+	cudaDeviceSynchronize();
+
+#if(BACKFACE_CULLING)
+	//------------------------------
+	//Back Face Cull
+	//------------------------------
+	primitiveNum = backFaceCull(primitives,ibosize/3);
+	primitiveBlocks = ceil(((float)primitiveNum)/((float)tileSize));
+#endif
+
+	//------------------------------
+	//rasterization
+	//------------------------------
+	rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives, primitiveNum, depthbuffer, depthBufferLock, resolution,dev_projectionTransform, isFlatShading, isMeshView);
+	checkCUDAError("rasterization failed!");
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//fragment shader
+	//------------------------------
+	fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, Light, resolution,viewTransform);
+	checkCUDAError("fragment shader failed!");
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//write fragments to framebuffer
+	//------------------------------
+	render<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer, framebuffer);
+	sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, resolution, framebuffer);
+
+	cudaDeviceSynchronize();
+
+	kernelCleanup();
+
+	cudaFree(dev_projectionTransform);
+	cudaFree(dev_MVtransform);
+	cudaFree(dev_MVPtransform);
+
+	delete projectionTransform,MVtransform,MVPtransform;
+
+
+	checkCUDAError("cuda core failed!");
 }
 
 void kernelCleanup(){
@@ -260,7 +594,9 @@ void kernelCleanup(){
   cudaFree( device_vbo );
   cudaFree( device_cbo );
   cudaFree( device_ibo );
+  cudaFree( device_nbo );
   cudaFree( framebuffer );
   cudaFree( depthbuffer );
+  cudaFree( depthBufferLock );
 }
 
diff --git a/src/rasterizeKernels.h b/src/rasterizeKernels.h
index 784be17..6da6d3c 100644
--- a/src/rasterizeKernels.h
+++ b/src/rasterizeKernels.h
@@ -10,7 +10,16 @@
 #include <cmath>
 #include "glm/glm.hpp"
 
+
+struct light{
+	glm::vec3 position;
+	glm::vec3 diffColor;
+	glm::vec3 specColor;
+	int specExp;
+	glm::vec3 ambColor;
+};
+
 void kernelCleanup();
-void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize);
+void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float * nbo, int nbosize,glm::mat4 glmViewTransform, glm::mat4 glmProjectionTransform, glm::mat4 glmMVtransform,light Light,int isFlatShading, int isMeshView);
 
 #endif //RASTERIZEKERNEL_H
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index e9b5dcc..28773cf 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -16,14 +16,23 @@ struct triangle {
   glm::vec3 c0;
   glm::vec3 c1;
   glm::vec3 c2;
+  glm::vec3 n0;
+  glm::vec3 n1;
+  glm::vec3 n2;
+  glm::vec3 flatNormal;
 };
 
 struct fragment{
   glm::vec3 color;
   glm::vec3 normal;
   glm::vec3 position;
+  glm::vec3 cameraSpacePosition;
+  bool isEmpty;
+  bool isFlat;
+  float coverage;
 };
 
+
 //Multiplies a cudaMat4 matrix and a vec4
 __host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
   glm::vec3 r(1,1,1);
@@ -33,6 +42,16 @@ __host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
   return r;
 }
 
+//Multiplies a cudaMat4 matrix and a vec4 and return vec4
+__host__ __device__ glm::vec4 multiplyMV4(cudaMat4 m, glm::vec4 v){
+  glm::vec4 r(1,1,1,1);
+  r.x = (m.x.x*v.x)+(m.x.y*v.y)+(m.x.z*v.z)+(m.x.w*v.w);
+  r.y = (m.y.x*v.x)+(m.y.y*v.y)+(m.y.z*v.z)+(m.y.w*v.w);
+  r.z = (m.z.x*v.x)+(m.z.y*v.y)+(m.z.z*v.z)+(m.z.w*v.w);
+  r.w = (m.w.x*v.x)+(m.w.y*v.y)+(m.w.z*v.z)+(m.w.w*v.w);
+  return r;
+}
+
 //LOOK: finds the axis aligned bounding box for a given triangle
 __host__ __device__ void getAABBForTriangle(triangle tri, glm::vec3& minpoint, glm::vec3& maxpoint){
   minpoint = glm::vec3(min(min(tri.p0.x, tri.p1.x),tri.p2.x), 
diff --git a/tyra1.jpg b/tyra1.jpg
new file mode 100644
index 0000000..a171762
Binary files /dev/null and b/tyra1.jpg differ
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
index f640485..06f6166 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
@@ -28,7 +28,7 @@
   </PropertyGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
   <ImportGroup Label="ExtensionSettings">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.props" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
   </ImportGroup>
   <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
     <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
@@ -65,6 +65,12 @@
       <AdditionalLibraryDirectories>$(SolutionDir)..\..\external\lib\win\GLFW;$(SolutionDir)..\..\external\lib\win\GL;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
       <AdditionalDependencies>cudart.lib;glew32s.lib;glfw3.lib;opengl32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>
     </Link>
+    <CudaCompile>
+      <GPUDebugInfo>false</GPUDebugInfo>
+    </CudaCompile>
+    <CudaCompile>
+      <GenerateLineInfo>false</GenerateLineInfo>
+    </CudaCompile>
   </ItemDefinitionGroup>
   <ItemGroup>
     <ClInclude Include="..\..\..\src\cudaMat4.h" />
@@ -87,6 +93,6 @@
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.targets" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
   </ImportGroup>
 </Project>
\ No newline at end of file