diff --git a/README.md b/README.md
index ae0896a..16ba874 100644
--- a/README.md
+++ b/README.md
@@ -1,184 +1,97 @@
 -------------------------------------------------------------------------------
 CIS565: Project 4: CUDA Rasterizer
 -------------------------------------------------------------------------------
-Fall 2014
+In this project, I implemented a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline; including vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading,  fragments to a frame buffering. 
+
 -------------------------------------------------------------------------------
-Due Monday 10/27/2014 @ 12 PM
+Basic Features:
 -------------------------------------------------------------------------------
+I have implemented the following stages of the graphics pipeline and basic features:
+
+* Vertex Shading
+* Primitive Assembly with support for triangle VBOs/IBOs
+* Perspective Transformation
+* Rasterization through either a scanline or a tiled approach
+* Fragment Shading
+* A depth buffer for storing and depth testing fragments
+* Fragment to framebuffer writing
+* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
 
 -------------------------------------------------------------------------------
-NOTE:
+Demo
 -------------------------------------------------------------------------------
-This project requires an NVIDIA graphics card with CUDA capability! Any card with CUDA compute capability 1.1 or higher will work fine for this project. For a full list of CUDA capable cards and their compute capability, please consult: http://developer.nvidia.com/cuda/cuda-gpus. If you do not have an NVIDIA graphics card in the machine you are working on, feel free to use any machine in the SIG Lab or in Moore100 labs. All machines in the SIG Lab and Moore100 are equipped with CUDA capable NVIDIA graphics cards. If this too proves to be a problem, please contact Patrick or Karl as soon as possible.
-
+[![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/1ebcda9d585eceb58bd538cd547539c2d57c92e0/result/video.JPG)](http://youtu.be/Kxfwf9KqjOw)
 -------------------------------------------------------------------------------
-INTRODUCTION:
+Extra Features:
 -------------------------------------------------------------------------------
-In this project, you will implement a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. In this project, you will implement vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and write the resulting fragments to a framebuffer. More information about the rasterized graphics pipeline can be found in the class slides and in your notes from CIS560.
 
-The basecode provided includes an OBJ loader and much of the mundane I/O and bookkeeping code. The basecode also includes some functions that you may find useful, described below. The core rasterization pipeline is left for you to implement.
+* Back-face culling
 
-You MAY NOT use ANY raycasting/raytracing AT ALL in this project, EXCEPT in the fragment shader step. One of the purposes of this project is to see how a rasterization pipeline can generate graphics WITHOUT the need for raycasting! Raycasting may only be used in the fragment shader effect for interesting shading results, but is absolutely not allowed in any other stages of the pipeline.
+Back-face culling removes the primitives which are not invisible. The process makes rendering objects quicker and more efficient by reducing the number of polygons. 
+To determine whether the face is invisible, just to calculate the dot multiplication of the view direction and the normal of face. If the value is greater than zero, it means it is to be culled. And use string compaction to kill the threads with back-face primitive.
 
-Also, you MAY NOT use OpenGL ANYWHERE in this project, aside from the given OpenGL code for drawing Pixel Buffer Objects to the screen. Use of OpenGL for any pipeline stage instead of your own custom implementation will result in an incomplete project.
+* Correct color interpolation between points on a primitive
 
-Finally, note that while this basecode is meant to serve as a strong starting point for a CUDA rasterizer, you are not required to use this basecode if you wish, and you may also change any part of the basecode specification as you please, so long as the final rendered result is correct.
+For each primitive, only the color of each vertices are given. We can use interpolation to get the color of a certain point inside the primitive. 
+The mehtod here is simple. As barycentric coordinate of each point is already calculated and the sum of each barycentric coordinate value is equal to 1. So we can use coordinate as interpolation weight of each vertex color.
 
--------------------------------------------------------------------------------
-CONTENTS:
--------------------------------------------------------------------------------
-The Project4 root directory contains the following subdirectories:
-	
-* src/ contains the source code for the project. Both the Windows Visual Studio solution and the OSX makefile reference this folder for all source; the base source code compiles on OSX and Windows without modification.
-* objs/ contains example obj test files: cow.obj, cube.obj, tri.obj.
-* renders/ contains an example render of the given example cow.obj file with a z-depth fragment shader. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies needed for building and running on Windows 7.
+Following the the color interpolation result for a triangle primitive.
 
-The Windows and OSX versions of the project build and run exactly the same way as in Project0, Project1, and Project2.
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/color%20interpolate.PNG)
 
--------------------------------------------------------------------------------
-REQUIREMENTS:
--------------------------------------------------------------------------------
-In this project, you are given code for:
+* Anti-aliasing
 
-* A library for loading/reading standard Alias/Wavefront .obj format mesh files and converting them to OpenGL style VBOs/IBOs
-* A suggested order of kernels with which to implement the graphics pipeline
-* Working code for CUDA-GL interop
+A simple anti-aliasing method is to sum the color of the neighbouring positions and set the average value as the anti-aliasing result. In the program, I used the original pixel and its 8 surrounding neighbours to perform anti aliasing.
 
-You will need to implement the following stages of the graphics pipeline and features:
+The left image is without anti-aliasing and the right one is after anti aliasing. It makes the edge smoother but also makes the other part of the image a little blurred.
 
-* Vertex Shading
-* Primitive Assembly with support for triangle VBOs/IBOs
-* Perspective Transformation
-* Rasterization through either a scanline or a tiled approach
-* Fragment Shading
-* A depth buffer for storing and depth testing fragments
-* Fragment to framebuffer writing
-* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/antialiasing2.PNG)
 
-You are also required to implement at least 3 of the following features:
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/antialias.PNG)
 
-* Additional pipeline stages. Each one of these stages can count as 1 feature:
-   * Geometry shader
-   * Transformation feedback
-   * Back-face culling
-   * Scissor test
-   * Stencil test
-   * Blending
+-------------------------------------------------------------------------------
+PERFORMANCE EVALUATION
+-------------------------------------------------------------------------------
 
-IMPORTANT: For each of these stages implemented, you must also add a section to your README stating what the expected performance impact of that pipeline stage is, and real performance comparisons between your rasterizer with that stage and without.
+* Visual Performance
 
-* Correct color interpolation between points on a primitive
-* Texture mapping WITH texture filtering and perspective correct texture coordinates
-* Support for additional primitices. Each one of these can count as HALF of a feature.
-   * Lines
-   * Line strips
-   * Triangle fans
-   * Triangle strips
-   * Points
-* Anti-aliasing
-* Order-independent translucency using a k-buffer
-* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
 
--------------------------------------------------------------------------------
-BASE CODE TOUR:
--------------------------------------------------------------------------------
-You will be working primarily in two files: rasterizeKernel.cu, and rasterizerTools.h. Within these files, areas that you need to complete are marked with a TODO comment. Areas that are useful to and serve as hints for optional features are marked with TODO (Optional). Functions that are useful for reference are marked with the comment LOOK.
+When using multi threads to raster the objects, there may exist read and write conflict.
 
-* rasterizeKernels.cu contains the core rasterization pipeline. 
-	* A suggested sequence of kernels exists in this file, but you may choose to alter the order of this sequence or merge entire kernels if you see fit. For example, if you decide that doing has benefits, you can choose to merge the vertex shader and primitive assembly kernels, or merge the perspective transform into another kernel. There is not necessarily a right sequence of kernels (although there are wrong sequences, such as placing fragment shading before vertex shading), and you may choose any sequence you want. Please document in your README what sequence you choose and why.
-	* The provided kernels have had their input parameters removed beyond basic inputs such as the framebuffer. You will have to decide what inputs should go into each stage of the pipeline, and what outputs there should be. 
+Here is a cube without lock. In the image, some back-face color substitute the front-face color.
 
-* rasterizeTools.h contains various useful tools, including a number of barycentric coordinate related functions that you may find useful in implementing scanline based rasterization...
-	* A few pre-made structs are included for you to use, such as fragment and triangle. A simple rasterizer can be implemented with these structs as is. However, as with any part of the basecode, you may choose to modify, add to, use as-is, or outright ignore them as you see fit.
-	* If you do choose to add to the fragment struct, be sure to include in your README a rationale for why. 
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/orgCube.PNG)
 
-You will also want to familiarize yourself with:
+Back-face culling helps solve some conflict ions, but it's not guarantee for it. Because for the face with perpendicular normal to the view direction, it cannot determine whether it is visible or not.
 
-* main.cpp, which contains code that transfers VBOs/CBOs/IBOs to the rasterization pipeline. Interactive camera work will also have to be implemented in this file if you choose that feature.
-* utilities.h, which serves as a kitchen-sink of useful functions
+Here is a cube with back-face culling. From the image, we can see that the top and bottom base substitute the front-face color.
 
--------------------------------------------------------------------------------
-SOME RESOURCES:
--------------------------------------------------------------------------------
-The following resources may be useful for this project:
-
-* High-Performance Software Rasterization on GPUs
-	* Paper (HPG 2011): http://www.tml.tkk.fi/~samuli/publications/laine2011hpg_paper.pdf
-	* Code: http://code.google.com/p/cudaraster/ Note that looking over this code for reference with regard to the paper is fine, but we most likely will not grant any requests to actually incorporate any of this code into your project.
-	* Slides: http://bps11.idav.ucdavis.edu/talks/08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
-* The Direct3D 10 System (SIGGRAPH 2006) - for those interested in doing geometry shaders and transform feedback.
-	* http://133.11.9.3/~takeo/course/2006/media/papers/Direct3D10_siggraph2006.pdf
-* Multi-Fragment Eﬀects on the GPU using the k-Buﬀer - for those who want to do a k-buffer
-	* http://www.inf.ufrgs.br/~comba/papers/2007/kbuffer_preprint.pdf
-* FreePipe: A Programmable, Parallel Rendering Architecture for Efficient Multi-Fragment Effects (I3D 2010)
-	* https://sites.google.com/site/hmcen0921/cudarasterizer
-* Writing A Software Rasterizer In Javascript:
-	* Part 1: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-1.html
-	* Part 2: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-2.html
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/backFace.PNG)
 
--------------------------------------------------------------------------------
-NOTES ON GLM:
--------------------------------------------------------------------------------
-This project uses GLM, the GL Math library, for linear algebra. You need to know two important points on how GLM is used in this project:
+So, the lock or atomic function should be implemented. In my program, I add an attribute isLocked in fragment. But my function is not guaranteed to lock each pixel, so my cube is still not totally correct.
 
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but pre-Fermi cards do not play nice with GLM matrices. As such, in this project, GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is provided as multiplyMV() in intersections.h.
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/lockCube.PNG)
 
--------------------------------------------------------------------------------
-README
--------------------------------------------------------------------------------
-All students must replace or augment the contents of this Readme.md in a clear 
-manner with the following:
+* Time Efficiency
+The histogram shows the timing and FPS. Back-face culling helps improve the time efficiency a little. And the anti-aliasing function increases timing a lot.
+
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/table.JPG)
+
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/chart.JPG)
 
-* A brief description of the project and the specific features you implemented.
-* At least one screenshot of your project running.
-* A 30 second or longer video of your project running.  To create the video you
-  can use http://www.microsoft.com/expression/products/Encoder4_Overview.aspx 
-* A performance evaluation (described in detail below).
 
 -------------------------------------------------------------------------------
-PERFORMANCE EVALUATION
+Rendering Result
 -------------------------------------------------------------------------------
-The performance evaluation is where you will investigate how to make your CUDA
-programs more efficient using the skills you've learned in class. You must have
-performed at least one experiment on your code to investigate the positive or
-negative effects on performance. 
 
-We encourage you to get creative with your tweaks. Consider places in your code
-that could be considered bottlenecks and try to improve them. 
+In the following image, color represents the face normal:
 
-Each student should provide no more than a one page summary of their
-optimizations along with tables and or graphs to visually explain any
-performance differences.
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/normal.PNG)
 
--------------------------------------------------------------------------------
-THIRD PARTY CODE POLICY
--------------------------------------------------------------------------------
-* Use of any third-party code must be approved by asking on Piazza.  If it is approved, all students are welcome to use it.  Generally, we approve use of third-party code that is not a core part of the project.  For example, for the ray tracer, we would approve using a third-party library for loading models, but would not approve copying and pasting a CUDA function for doing refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another student's code, is an academic integrity violation, and will result in you receiving an F for the semester.
+Add shading to it:
 
--------------------------------------------------------------------------------
-SELF-GRADING
--------------------------------------------------------------------------------
-* On the submission date, email your grade, on a scale of 0 to 100, to Liam, harmoli+cis565@seas.upenn.edu, with a one paragraph explanation.  Be concise and realistic.  Recall that we reserve 30 points as a sanity check to adjust your grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We hope to only use this in extreme cases when your grade does not realistically reflect your work - it is either too high or too low.  In most cases, we plan to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as the path tracer.  We will determine the weighting at the end of the semester based on the size of each project.
-
----
-SUBMISSION
----
-As with the previous project, you should fork this project and work inside of
-your fork. Upon completion, commit your finished project back to your fork, and
-make a pull request to the master repository.  You should include a README.md
-file in the root directory detailing the following
-
-* A brief description of the project and specific features you implemented
-* At least one screenshot of your project running.
-* A link to a video of your raytracer running.
-* Instructions for building and running your project if they differ from the
-  base code.
-* A performance writeup as detailed above.
-* A list of all third-party code used.
-* This Readme file edited as described above in the README section.
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/shading.PNG)
+
+Add color to each vertices:
 
+![ScreenShot](https://github.com/liying3/Project4-Rasterizer/blob/master/result/withut%20light.PNG)
\ No newline at end of file
diff --git a/result/PROJ4_Rasterizer.wmv b/result/PROJ4_Rasterizer.wmv
new file mode 100644
index 0000000..0a30152
Binary files /dev/null and b/result/PROJ4_Rasterizer.wmv differ
diff --git a/result/antialias.PNG b/result/antialias.PNG
new file mode 100644
index 0000000..f7170fb
Binary files /dev/null and b/result/antialias.PNG differ
diff --git a/result/antialiasing2.PNG b/result/antialiasing2.PNG
new file mode 100644
index 0000000..60821de
Binary files /dev/null and b/result/antialiasing2.PNG differ
diff --git a/result/backFace.PNG b/result/backFace.PNG
new file mode 100644
index 0000000..c6e2535
Binary files /dev/null and b/result/backFace.PNG differ
diff --git a/result/chart.JPG b/result/chart.JPG
new file mode 100644
index 0000000..cbb03b3
Binary files /dev/null and b/result/chart.JPG differ
diff --git a/result/color interpolate.PNG b/result/color interpolate.PNG
new file mode 100644
index 0000000..522fce7
Binary files /dev/null and b/result/color interpolate.PNG differ
diff --git a/result/cube.PNG b/result/cube.PNG
new file mode 100644
index 0000000..88185a4
Binary files /dev/null and b/result/cube.PNG differ
diff --git a/result/lockCube.PNG b/result/lockCube.PNG
new file mode 100644
index 0000000..2b83dd3
Binary files /dev/null and b/result/lockCube.PNG differ
diff --git a/result/normal.PNG b/result/normal.PNG
new file mode 100644
index 0000000..a9e71f1
Binary files /dev/null and b/result/normal.PNG differ
diff --git a/result/orgCube.PNG b/result/orgCube.PNG
new file mode 100644
index 0000000..427ba57
Binary files /dev/null and b/result/orgCube.PNG differ
diff --git a/result/shading.PNG b/result/shading.PNG
new file mode 100644
index 0000000..2eb75fb
Binary files /dev/null and b/result/shading.PNG differ
diff --git a/result/table.JPG b/result/table.JPG
new file mode 100644
index 0000000..5f53e2e
Binary files /dev/null and b/result/table.JPG differ
diff --git a/result/withut light.PNG b/result/withut light.PNG
new file mode 100644
index 0000000..63626a6
Binary files /dev/null and b/result/withut light.PNG differ
diff --git a/src/main.cpp b/src/main.cpp
index 13d8e67..2c2b668 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -84,22 +84,42 @@ void runCuda(){
   vbo = mesh->getVBO();
   vbosize = mesh->getVBOsize();
 
-  float newcbo[] = {0.0, 1.0, 0.0, 
-                    0.0, 0.0, 1.0, 
-                    1.0, 0.0, 0.0};
+  float newcbo[] = {1.0, 0.0, 0.0, 
+                    0.0, 1.0, 0.0, 
+                    0.0, 0.0, 1.0};
   cbo = newcbo;
   cbosize = 9;
 
   ibo = mesh->getIBO();
   ibosize = mesh->getIBOsize();
 
+  nbo = mesh->getNBO();
+  nbosize = mesh->getNBOsize();
+
+  //Rx += 1.0 * PI / 180.0;
+  Ry += 1.0 * PI / 180.0;
+  //Ry = PI/4.0;
+
+  float viewLength = glm::length(center - eye);
+  eye.x = center.x + viewLength * cos(Rx) * cos(Ry);
+  eye.y = center.y + viewLength * sin(Rx);
+  eye.z = center.z + viewLength * cos(Rx) * sin(Ry);
+
+  glm::mat4 modelView = glm::lookAt(eye, center, up);
+  glm::mat4 projection = glm::perspective(fovy, width / (float)height, 0.1f, 100.0f);
+  glm::mat4 mvp = viewPort * projection * modelView;
+  modelViewProj.x = mvp[0];	modelViewProj.y = mvp[1];	modelViewProj.z = mvp[2];	modelViewProj.w = mvp[3];
+
+  glm::vec4 lightP = projection * modelView * glm::vec4(lightPos, 1.0f);
   cudaGLMapBufferObject((void**)&dptr, pbo);
-  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize);
+  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize, nbo, nbosize, modelViewProj,/* viewPort,*/ glm::vec3(lightP.x, lightP.y,lightP.z), lightRGB,
+					true, center-eye, false);
   cudaGLUnmapBufferObject(pbo);
 
   vbo = NULL;
   cbo = NULL;
   ibo = NULL;
+  nbo = NULL;
 
   frame++;
   fpstracker++;
@@ -138,12 +158,16 @@ bool init(int argc, char* argv[]) {
   initTextures();
   initCuda();
   initPBO();
-  
+
   GLuint passthroughProgram;
   passthroughProgram = initShader();
 
   glUseProgram(passthroughProgram);
   glActiveTexture(GL_TEXTURE0);
+  
+  //interactive
+  //glutMouseFunc(onMouseButton);
+  //glutMotionFunc(onMouseDrag);
 
   return true;
 }
@@ -281,4 +305,44 @@ void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods
     if(key == GLFW_KEY_ESCAPE && action == GLFW_PRESS){
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
+}
+
+
+void onMouseButton(int button, int state, int x, int y)
+{
+	if (button == GLUT_LEFT_BUTTON) {
+		if (state == GLUT_DOWN) {
+			lastMousePos.x = x;
+			lastMousePos.y = y;
+
+			mouseMode = TransMode;
+		}
+	}
+	else if (button == GLUT_RIGHT_BUTTON) {
+		if (state == GLUT_DOWN) {
+			lastMousePos.x = x;
+			lastMousePos.y = y;
+
+			mouseMode = RotateMode;
+		}
+	}
+}
+
+void onMouseDrag(int x, int y)
+{
+	float dx = (x - lastMousePos.x) * translateStep;
+	float dy = (y - lastMousePos.y) * translateStep;
+
+	if (mouseMode == TransMode) {
+		eye.x += dx;
+		eye.y += dy;
+		center.x += dx;
+		center.y += dy;
+
+		lastMousePos.x = x;
+		lastMousePos.y = y;
+	}
+	else if (mouseMode == RotateMode) {
+
+	}
 }
\ No newline at end of file
diff --git a/src/main.h b/src/main.h
index 8999110..0b5a675 100644
--- a/src/main.h
+++ b/src/main.h
@@ -5,12 +5,14 @@
 #define MAIN_H
 
 #include <GL/glew.h>
+#include <GL/glut.h>
 #include <GLFW/glfw3.h>
 
 #include <cuda_runtime.h>
 #include <cuda_gl_interop.h>
 #include <fstream>
 #include <glm/glm.hpp>
+#include <glm/gtc/matrix_transform.hpp>
 #include <glslUtil/glslUtility.hpp>
 #include <iostream>
 #include <objUtil/objloader.h>
@@ -49,6 +51,8 @@ float* cbo;
 int cbosize;
 int* ibo;
 int ibosize;
+float* nbo;
+int nbosize;
 
 //-------------------------------
 //----------CUDA STUFF-----------
@@ -56,6 +60,31 @@ int ibosize;
 
 int width = 800; int height = 800;
 
+//-------------------------------
+//----------Camera STUFF-----------
+//-------------------------------
+
+glm::vec3 view(0, 0, -1);
+glm::vec3 up(0, 1, 0);
+glm::vec3 eye(0,0,1.5f);
+glm::vec3 center(0, 0.3, 0);
+float fovy = 60;
+float Rx = 0;
+float Ry = 0;
+
+cudaMat4 modelViewProj;
+cudaMat4 inverseMVP;
+glm::mat4 viewPort(-width/2.0, 0, 0, 0,  0, -height/2.0, 0, 0,  0, 0, 0.5, 0,  width/2.0, height/2.0, 0.5, 1.0);
+//glm::mat4 viewPort(-1, 0, 0, 0,  0, -1, 0, 0,  0, 0, 1, 0,  width/2.0, height/2.0, 1, 1);
+glm::vec3 lightPos(10, 10, 10);
+glm::vec3 lightRGB(1, 1, 1);
+
+//interaction
+int mouseMode = 0;
+enum MOUSEMODE{None, TransMode, RotateMode};
+glm::vec2 lastMousePos(0, 0);
+float translateStep = 1.0f/256.0f;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -85,6 +114,7 @@ void initTextures();
 void initVAO();
 GLuint initShader();
 
+
 //-------------------------------
 //---------CLEANUP STUFF---------
 //-------------------------------
@@ -100,4 +130,18 @@ void mainLoop();
 void errorCallback(int error, const char *description);
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods);
 
+//------------------------------
+//------- Helper ---------------
+//------------------------------
+
+//void computeProjection(float fovy, float aspect, float zNear, float zFar);
+//void computeViewMat(glm::vec3 );
+
+//------------------------------
+//------- Interactive ----------
+//------------------------------
+void onMouseButton(int button, int state, int x, int y);
+
+void onMouseDrag(int x, int y);
+
 #endif
\ No newline at end of file
diff --git a/src/rasterizeKernels.cu b/src/rasterizeKernels.cu
index 10b0000..21cc62e 100644
--- a/src/rasterizeKernels.cu
+++ b/src/rasterizeKernels.cu
@@ -7,12 +7,15 @@
 #include <thrust/random.h>
 #include "rasterizeKernels.h"
 #include "rasterizeTools.h"
+#include "thrust\device_ptr.h"
+#include "thrust\remove.h"
 
 glm::vec3* framebuffer;
 fragment* depthbuffer;
 float* device_vbo;
 float* device_cbo;
 int* device_ibo;
+float* device_nbo;
 triangle* primitives;
 
 void checkCUDAError(const char *msg) {
@@ -129,36 +132,143 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
 }
 
 //TODO: Implement a vertex shader
-__global__ void vertexShadeKernel(float* vbo, int vbosize){
+__global__ void vertexShadeKernel(float* vbo, int vbosize, cudaMat4 modelView){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<vbosize/3){
+	  glm::vec3 v(vbo[index*3], vbo[index*3+1], vbo[index*3+2]);
+
+	  v = multiplyMV(modelView, glm::vec4(v, 1.0));
+
+	  //v = modelView * v;
+	  //v /= v.w;
+	  //glm::vec4 v2 = viewPort * glm::vec4(v,1.0);
+
+	  vbo[index*3] = v.x;
+	  vbo[index*3+1] = v.y;
+	  vbo[index*3+2] = v.z;
   }
 }
 
 //TODO: Implement primative assembly
-__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives){
+__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, triangle* primitives, 
+										bool backFaceCulling, glm::vec3 viewDir){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   int primitivesCount = ibosize/3;
   if(index<primitivesCount){
+	  int vi1 = ibo[index*3];
+	  int vi2 = ibo[index*3+1];
+	  int vi3 = ibo[index*3+2];
+
+	  primitives[index].p0 = glm::vec3(vbo[vi1*3], vbo[vi1*3+1], vbo[vi1*3+2]);
+	  primitives[index].p1 = glm::vec3(vbo[vi2*3], vbo[vi2*3+1], vbo[vi2*3+2]);
+	  primitives[index].p2 = glm::vec3(vbo[vi3*3], vbo[vi3*3+1], vbo[vi3*3+2]);
+
+	  primitives[index].c0 = glm::vec3(cbo[0], cbo[1], cbo[2]);
+	  primitives[index].c1 = glm::vec3(cbo[3], cbo[4], cbo[5]);
+	  primitives[index].c2 = glm::vec3(cbo[6], cbo[7], cbo[8]);
+
+	  glm::vec3 n1(nbo[vi1*3], nbo[vi1*3+1], nbo[vi1*3+2]);
+	  glm::vec3 n2(nbo[vi2*3], nbo[vi2*3+1], nbo[vi2*3+2]);
+	  glm::vec3 n3(nbo[vi3*3], nbo[vi3*3+1], nbo[vi3*3+2]);
+
+	  primitives[index].n = glm::normalize((n1+n2+n3) / 3.0f);
+
+	  //back-face culling
+	  if (backFaceCulling)
+	  {
+		  if (glm::dot(primitives[index].n, viewDir) > 0)
+			  primitives[index].n = glm::vec3(0,0,0);
+	  }
   }
 }
 
 //TODO: Implement a rasterization method, such as scanline.
 __global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
   if(index<primitivesCount){
+	//  if (primitives[index].n.x != 0 || primitives[index].n.y != 0 || primitives[index].n.z != 0)
+	//  {
+		  triangle primitive = primitives[index];
+		  glm::vec3 minpoint, maxpoint;
+		  getAABBForTriangle(primitive, minpoint, maxpoint);
+
+		  minpoint.x = (minpoint.x < 0) ? 0 : minpoint.x;
+		  maxpoint.x = (maxpoint.x > resolution.x) ? resolution.x : maxpoint.x;
+		  minpoint.y = (minpoint.y < 0) ? 0 : minpoint.y;
+		  maxpoint.y = (maxpoint.y > resolution.y) ? resolution.y : maxpoint.y;
+
+		  for (int y = minpoint.y; y <= maxpoint.y; y++)
+		  {
+			  for(int x = minpoint.x; x <= maxpoint.x; x++)
+			  {
+				int depthbufferId = x + y * resolution.x;
+
+				glm::vec2 center(x+0.5, y+0.5);
+				glm::vec3 barycentricP = calculateBarycentricCoordinate(primitive, center);
+
+				float z = getZAtCoordinate(barycentricP, primitive);
+
+				//wait until unlock
+				while (!depthbuffer[depthbufferId].isLocked)
+					depthbuffer[depthbufferId].isLocked = true;
+
+				if (z > depthbuffer[depthbufferId].position.z  && isBarycentricCoordInBounds(barycentricP))
+				{
+					depthbuffer[depthbufferId].position = primitive.p0 * barycentricP.x + primitive.p1 * barycentricP.y + primitive.p2 * barycentricP.z;
+					depthbuffer[depthbufferId].normal = primitive.n;
+					depthbuffer[depthbufferId].color = primitive.c0 * barycentricP.x + primitive.c1 * barycentricP.y + primitive.c2 * barycentricP.z;
+				}
+				depthbuffer[depthbufferId].isLocked = false;
+
+			  }
+		  }
+	//  }
+	//  else;
   }
 }
 
 //TODO: Implement a fragment shader
-__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution){
+__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution, glm::vec3 lightPos, glm::vec3 lightRGB){
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
   if(x<=resolution.x && y<=resolution.y){
+
+	  fragment frag = depthbuffer[index];
+
+	  glm::vec3 incident = glm::normalize(frag.position - lightPos);
+	  float cosI = glm::dot(incident, frag.normal);
+	  cosI = cosI > 0.0 ? 0.0 : -cosI;
+	  frag.color = frag.color * cosI * lightRGB * 0.7f + frag.color * 0.3f;
+
+	  /*frag.color.r = abs(frag.normal.x);
+	  frag.color.g = abs(frag.normal.y);
+	  frag.color.b = abs(frag.normal.z);*/
+
+	  depthbuffer[index] = frag;
   }
 }
 
+__global__ void antiAliasKernel(fragment* depthbuffer, glm::vec2 resolution){
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+	if(x<=resolution.x && y<=resolution.y){
+		glm::vec3 newColor;
+		if (x >= 1 && y >= 1 && x<=resolution.x-1 && y<=resolution.y-1) {
+			for (int i = x-1; i <= x+1; i++){
+				for (int j = y-1; j <= y+1; j++){
+					int id = i + (j * resolution.x);
+					newColor += depthbuffer[id].color;
+				}
+			}
+			newColor /= 9.0f;
+		}
+		depthbuffer[index].color = newColor;
+	}
+}
+
 //Writes fragment colors to the framebuffer
 __global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* framebuffer){
 
@@ -172,7 +282,9 @@ __global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* f
 }
 
 // Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
-void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize){
+void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize,
+						cudaMat4 modelViewProj, glm::vec3 lightPos, glm::vec3 lightRGB,
+						bool backFaceCulling, glm::vec3 vdir, bool antiAlias){
 
   // set up crucial magic
   int tileSize = 8;
@@ -194,6 +306,7 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   frag.color = glm::vec3(0,0,0);
   frag.normal = glm::vec3(0,0,0);
   frag.position = glm::vec3(0,0,-10000);
+  frag.isLocked = false;
   clearDepthBuffer<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer,frag);
 
   //------------------------------
@@ -214,22 +327,42 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
   cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
 
+  device_nbo = NULL;
+  cudaMalloc((void**)&device_nbo, nbosize*sizeof(float));
+  cudaMemcpy( device_nbo, nbo, nbosize*sizeof(float), cudaMemcpyHostToDevice);
+
   tileSize = 32;
   int primitiveBlocks = ceil(((float)vbosize/3)/((float)tileSize));
 
   //------------------------------
   //vertex shader
   //------------------------------
-  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize);
+  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, modelViewProj);
 
   cudaDeviceSynchronize();
   //------------------------------
   //primitive assembly
   //------------------------------
+
+  //vdir = multiplyMV(inverseMV, glm::vec4(vdir,1));
+
   primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
-  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives);
+  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, device_nbo, nbosize, primitives,
+							backFaceCulling, vdir);
 
   cudaDeviceSynchronize();
+
+  //stream compact
+  if (backFaceCulling)
+  {
+	  //stream compact
+	  thrust::device_ptr<triangle> beginItr(primitives);
+	  thrust::device_ptr<triangle> endItr = beginItr + ibosize/3;
+	  endItr = thrust::remove_if(beginItr, endItr, isInvisible());
+	  int numPrimitives = (int)(endItr - beginItr);
+	  primitiveBlocks = ceil(((float)numPrimitives)/((float)tileSize));
+  }
+
   //------------------------------
   //rasterization
   //------------------------------
@@ -239,9 +372,17 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   //------------------------------
   //fragment shader
   //------------------------------
-  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
+  
+  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution, lightPos, lightRGB );
 
   cudaDeviceSynchronize();
+
+  if (antiAlias)
+  {
+	  antiAliasKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
+
+	  cudaDeviceSynchronize();
+  }
   //------------------------------
   //write fragments to framebuffer
   //------------------------------
diff --git a/src/rasterizeKernels.h b/src/rasterizeKernels.h
index 784be17..65bc076 100644
--- a/src/rasterizeKernels.h
+++ b/src/rasterizeKernels.h
@@ -9,8 +9,11 @@
 #include <cuda.h>
 #include <cmath>
 #include "glm/glm.hpp"
+#include "cudaMat4.h"
 
 void kernelCleanup();
-void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize);
+void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize,
+						cudaMat4 modelView,glm::vec3 lightPos, glm::vec3 lightRGB,
+						bool backFaceCulling, glm::vec3 vdir, bool antiAlias);
 
 #endif //RASTERIZEKERNEL_H
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index e9b5dcc..cfbe975 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -16,20 +16,36 @@ struct triangle {
   glm::vec3 c0;
   glm::vec3 c1;
   glm::vec3 c2;
+  glm::vec3 n;
 };
 
 struct fragment{
   glm::vec3 color;
   glm::vec3 normal;
   glm::vec3 position;
+  bool isLocked;
+};
+
+struct isInvisible {
+	__host__ __device__
+	bool operator()(const triangle& primitive)
+	{
+		return (primitive.n.x == 0 && primitive.n.y == 0 && primitive.n.z == 0);
+	}
 };
 
 //Multiplies a cudaMat4 matrix and a vec4
 __host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
   glm::vec3 r(1,1,1);
-  r.x = (m.x.x*v.x)+(m.x.y*v.y)+(m.x.z*v.z)+(m.x.w*v.w);
+  /*r.x = (m.x.x*v.x)+(m.x.y*v.y)+(m.x.z*v.z)+(m.x.w*v.w);
   r.y = (m.y.x*v.x)+(m.y.y*v.y)+(m.y.z*v.z)+(m.y.w*v.w);
-  r.z = (m.z.x*v.x)+(m.z.y*v.y)+(m.z.z*v.z)+(m.z.w*v.w);
+  r.z = (m.z.x*v.x)+(m.z.y*v.y)+(m.z.z*v.z)+(m.z.w*v.w);*/
+
+  r.x = (m.x.x*v.x)+(m.y.x*v.y)+(m.z.x*v.z)+(m.w.x*v.w);
+  r.y = (m.x.y*v.x)+(m.y.y*v.y)+(m.z.y*v.z)+(m.w.y*v.w);
+  r.z = (m.x.z*v.x)+(m.y.z*v.y)+(m.z.z*v.z)+(m.w.z*v.w);
+  float w = (m.x.w*v.x)+(m.y.w*v.y)+(m.z.w*v.z)+(m.w.w*v.w);
+  r = (w > 0.001) ? r/w : r;
   return r;
 }
 
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
index f640485..7a6dcc4 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
@@ -28,7 +28,7 @@
   </PropertyGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
   <ImportGroup Label="ExtensionSettings">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.props" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
   </ImportGroup>
   <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
     <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
@@ -87,6 +87,6 @@
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.targets" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
   </ImportGroup>
 </Project>
\ No newline at end of file