diff --git a/Performance.xls b/Performance.xls
new file mode 100644
index 0000000..3a99d0b
Binary files /dev/null and b/Performance.xls differ
diff --git a/README.md b/README.md
index ae0896a..e2d6f88 100644
--- a/README.md
+++ b/README.md
@@ -1,184 +1,63 @@
 -------------------------------------------------------------------------------
 CIS565: Project 4: CUDA Rasterizer
 -------------------------------------------------------------------------------
-Fall 2014
--------------------------------------------------------------------------------
-Due Monday 10/27/2014 @ 12 PM
--------------------------------------------------------------------------------
+All required features are implemented:
 
--------------------------------------------------------------------------------
-NOTE:
--------------------------------------------------------------------------------
-This project requires an NVIDIA graphics card with CUDA capability! Any card with CUDA compute capability 1.1 or higher will work fine for this project. For a full list of CUDA capable cards and their compute capability, please consult: http://developer.nvidia.com/cuda/cuda-gpus. If you do not have an NVIDIA graphics card in the machine you are working on, feel free to use any machine in the SIG Lab or in Moore100 labs. All machines in the SIG Lab and Moore100 are equipped with CUDA capable NVIDIA graphics cards. If this too proves to be a problem, please contact Patrick or Karl as soon as possible.
-
--------------------------------------------------------------------------------
-INTRODUCTION:
--------------------------------------------------------------------------------
-In this project, you will implement a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. In this project, you will implement vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and write the resulting fragments to a framebuffer. More information about the rasterized graphics pipeline can be found in the class slides and in your notes from CIS560.
-
-The basecode provided includes an OBJ loader and much of the mundane I/O and bookkeeping code. The basecode also includes some functions that you may find useful, described below. The core rasterization pipeline is left for you to implement.
-
-You MAY NOT use ANY raycasting/raytracing AT ALL in this project, EXCEPT in the fragment shader step. One of the purposes of this project is to see how a rasterization pipeline can generate graphics WITHOUT the need for raycasting! Raycasting may only be used in the fragment shader effect for interesting shading results, but is absolutely not allowed in any other stages of the pipeline.
-
-Also, you MAY NOT use OpenGL ANYWHERE in this project, aside from the given OpenGL code for drawing Pixel Buffer Objects to the screen. Use of OpenGL for any pipeline stage instead of your own custom implementation will result in an incomplete project.
-
-Finally, note that while this basecode is meant to serve as a strong starting point for a CUDA rasterizer, you are not required to use this basecode if you wish, and you may also change any part of the basecode specification as you please, so long as the final rendered result is correct.
-
--------------------------------------------------------------------------------
-CONTENTS:
--------------------------------------------------------------------------------
-The Project4 root directory contains the following subdirectories:
+	* Vertex Shading
 	
-* src/ contains the source code for the project. Both the Windows Visual Studio solution and the OSX makefile reference this folder for all source; the base source code compiles on OSX and Windows without modification.
-* objs/ contains example obj test files: cow.obj, cube.obj, tri.obj.
-* renders/ contains an example render of the given example cow.obj file with a z-depth fragment shader. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies needed for building and running on Windows 7.
-
-The Windows and OSX versions of the project build and run exactly the same way as in Project0, Project1, and Project2.
-
--------------------------------------------------------------------------------
-REQUIREMENTS:
--------------------------------------------------------------------------------
-In this project, you are given code for:
-
-* A library for loading/reading standard Alias/Wavefront .obj format mesh files and converting them to OpenGL style VBOs/IBOs
-* A suggested order of kernels with which to implement the graphics pipeline
-* Working code for CUDA-GL interop
-
-You will need to implement the following stages of the graphics pipeline and features:
-
-* Vertex Shading
-* Primitive Assembly with support for triangle VBOs/IBOs
-* Perspective Transformation
-* Rasterization through either a scanline or a tiled approach
-* Fragment Shading
-* A depth buffer for storing and depth testing fragments
-* Fragment to framebuffer writing
-* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
-
-You are also required to implement at least 3 of the following features:
-
-* Additional pipeline stages. Each one of these stages can count as 1 feature:
-   * Geometry shader
-   * Transformation feedback
-   * Back-face culling
-   * Scissor test
-   * Stencil test
-   * Blending
-
-IMPORTANT: For each of these stages implemented, you must also add a section to your README stating what the expected performance impact of that pipeline stage is, and real performance comparisons between your rasterizer with that stage and without.
-
-* Correct color interpolation between points on a primitive
-* Texture mapping WITH texture filtering and perspective correct texture coordinates
-* Support for additional primitices. Each one of these can count as HALF of a feature.
-   * Lines
-   * Line strips
-   * Triangle fans
-   * Triangle strips
-   * Points
-* Anti-aliasing
-* Order-independent translucency using a k-buffer
-* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
-
--------------------------------------------------------------------------------
-BASE CODE TOUR:
--------------------------------------------------------------------------------
-You will be working primarily in two files: rasterizeKernel.cu, and rasterizerTools.h. Within these files, areas that you need to complete are marked with a TODO comment. Areas that are useful to and serve as hints for optional features are marked with TODO (Optional). Functions that are useful for reference are marked with the comment LOOK.
-
-* rasterizeKernels.cu contains the core rasterization pipeline. 
-	* A suggested sequence of kernels exists in this file, but you may choose to alter the order of this sequence or merge entire kernels if you see fit. For example, if you decide that doing has benefits, you can choose to merge the vertex shader and primitive assembly kernels, or merge the perspective transform into another kernel. There is not necessarily a right sequence of kernels (although there are wrong sequences, such as placing fragment shading before vertex shading), and you may choose any sequence you want. Please document in your README what sequence you choose and why.
-	* The provided kernels have had their input parameters removed beyond basic inputs such as the framebuffer. You will have to decide what inputs should go into each stage of the pipeline, and what outputs there should be. 
-
-* rasterizeTools.h contains various useful tools, including a number of barycentric coordinate related functions that you may find useful in implementing scanline based rasterization...
-	* A few pre-made structs are included for you to use, such as fragment and triangle. A simple rasterizer can be implemented with these structs as is. However, as with any part of the basecode, you may choose to modify, add to, use as-is, or outright ignore them as you see fit.
-	* If you do choose to add to the fragment struct, be sure to include in your README a rationale for why. 
-
-You will also want to familiarize yourself with:
+	* Primitive Assembly with support for triangle VBOs/IBOs
+	
+	* Perspective Transformation
+	
+	* Rasterization through either a scanline or a tiled approach
+	
+	* Fragment Shading
+	
+	* A depth buffer for storing and depth testing fragments
+	
+	* Fragment to framebuffer writing
+	
+	* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
 
-* main.cpp, which contains code that transfers VBOs/CBOs/IBOs to the rasterization pipeline. Interactive camera work will also have to be implemented in this file if you choose that feature.
-* utilities.h, which serves as a kitchen-sink of useful functions
+3 additional features are:
 
--------------------------------------------------------------------------------
-SOME RESOURCES:
--------------------------------------------------------------------------------
-The following resources may be useful for this project:
+	* Back-face culling
 
-* High-Performance Software Rasterization on GPUs
-	* Paper (HPG 2011): http://www.tml.tkk.fi/~samuli/publications/laine2011hpg_paper.pdf
-	* Code: http://code.google.com/p/cudaraster/ Note that looking over this code for reference with regard to the paper is fine, but we most likely will not grant any requests to actually incorporate any of this code into your project.
-	* Slides: http://bps11.idav.ucdavis.edu/talks/08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
-* The Direct3D 10 System (SIGGRAPH 2006) - for those interested in doing geometry shaders and transform feedback.
-	* http://133.11.9.3/~takeo/course/2006/media/papers/Direct3D10_siggraph2006.pdf
-* Multi-Fragment Eﬀects on the GPU using the k-Buﬀer - for those who want to do a k-buffer
-	* http://www.inf.ufrgs.br/~comba/papers/2007/kbuffer_preprint.pdf
-* FreePipe: A Programmable, Parallel Rendering Architecture for Efficient Multi-Fragment Effects (I3D 2010)
-	* https://sites.google.com/site/hmcen0921/cudarasterizer
-* Writing A Software Rasterizer In Javascript:
-	* Part 1: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-1.html
-	* Part 2: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-2.html
+	* Correct color interpolation between points on a primitive
 
--------------------------------------------------------------------------------
-NOTES ON GLM:
--------------------------------------------------------------------------------
-This project uses GLM, the GL Math library, for linear algebra. You need to know two important points on how GLM is used in this project:
-
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but pre-Fermi cards do not play nice with GLM matrices. As such, in this project, GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is provided as multiplyMV() in intersections.h.
+	* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
+	
+---------------------------------------------------------------------------------
+Performance Analysis
+---------------------------------------------------------------------------------
+All the data are listed in the Performance.xls file. 
+At the beginning, I think after the backface culling phases, as the face number is only half to 1/3 of the original model, I think the FPS should at least 20% to 40% higher. 
 
--------------------------------------------------------------------------------
-README
--------------------------------------------------------------------------------
-All students must replace or augment the contents of this Readme.md in a clear 
-manner with the following:
+However, the actually improvement is subtle. I performed a compaction of the premitives which removed ones with reverse normal. It might be that the stream compaction costs larger than rendering the back faces. 
 
-* A brief description of the project and the specific features you implemented.
-* At least one screenshot of your project running.
-* A 30 second or longer video of your project running.  To create the video you
-  can use http://www.microsoft.com/expression/products/Encoder4_Overview.aspx 
-* A performance evaluation (described in detail below).
+The render time firstly reduced as the number of premitives increases and then the time increases. For the first decreasing, I think it is because a number of thread will simultaneously visit the same triangle and result in the low performance. And for the increasing, it is because that the number of premitives to test is increasing. 
 
--------------------------------------------------------------------------------
-PERFORMANCE EVALUATION
--------------------------------------------------------------------------------
-The performance evaluation is where you will investigate how to make your CUDA
-programs more efficient using the skills you've learned in class. You must have
-performed at least one experiment on your code to investigate the positive or
-negative effects on performance. 
+Because of the memory is limited, I cannot test cases with even more premitives. 
 
-We encourage you to get creative with your tweaks. Consider places in your code
-that could be considered bottlenecks and try to improve them. 
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/CullingPersentage.png)
+Culling Scale
 
-Each student should provide no more than a one page summary of their
-optimizations along with tables and or graphs to visually explain any
-performance differences.
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/Performance.png)
+Performance
 
--------------------------------------------------------------------------------
-THIRD PARTY CODE POLICY
--------------------------------------------------------------------------------
-* Use of any third-party code must be approved by asking on Piazza.  If it is approved, all students are welcome to use it.  Generally, we approve use of third-party code that is not a core part of the project.  For example, for the ray tracer, we would approve using a third-party library for loading models, but would not approve copying and pasting a CUDA function for doing refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another student's code, is an academic integrity violation, and will result in you receiving an F for the semester.
+---------------------------------------------------------------------------------
+Render Result
+---------------------------------------------------------------------------------
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/ColorInterpolation.png)
+Color interpolation
 
--------------------------------------------------------------------------------
-SELF-GRADING
--------------------------------------------------------------------------------
-* On the submission date, email your grade, on a scale of 0 to 100, to Liam, harmoli+cis565@seas.upenn.edu, with a one paragraph explanation.  Be concise and realistic.  Recall that we reserve 30 points as a sanity check to adjust your grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We hope to only use this in extreme cases when your grade does not realistically reflect your work - it is either too high or too low.  In most cases, we plan to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as the path tracer.  We will determine the weighting at the end of the semester based on the size of each project.
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/NoAA.png)
+Moiré pattern
 
----
-SUBMISSION
----
-As with the previous project, you should fork this project and work inside of
-your fork. Upon completion, commit your finished project back to your fork, and
-make a pull request to the master repository.  You should include a README.md
-file in the root directory detailing the following
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/cow.png)
+![Alt text](https://github.com/chiwsy/Project4-Rasterizer/blob/master/renders/cowHighPoly.png)
+cow test case
 
-* A brief description of the project and specific features you implemented
-* At least one screenshot of your project running.
-* A link to a video of your raytracer running.
-* Instructions for building and running your project if they differ from the
-  base code.
-* A performance writeup as detailed above.
-* A list of all third-party code used.
-* This Readme file edited as described above in the README section.
+[Check out the video on YouTube](https://www.youtube.com/watch?v=wuN5sA_rS_E)
 
+[Check out the video on YouTube](https://www.youtube.com/watch?v=4dI_E-bop1A)
\ No newline at end of file
diff --git a/objs/cube.obj b/objs/cube.obj
deleted file mode 100644
index 1375c11..0000000
--- a/objs/cube.obj
+++ /dev/null
@@ -1,28 +0,0 @@
-v -0.1 -0.1 -0.1
-v -0.1 -0.1 0.1
-v -0.1 0.1 0.1
-v -0.1 0.1 -0.1
-v 0.1 0.1 -0.1
-v 0.1 0.1 0.1
-v 0.1 -0.1 0.1
-v 0.1 -0.1 -0.1
-
-vn 1 0 0
-vn 0 1 0
-vn 0 0 1
-vn -1 0 0
-vn 0 -1 0
-vn 0 0 -1
-
-f 1//4 2//4 4//4
-f 4//4 2//4 3//4
-f 8//6 1//6 5//6
-f 5//6 1//6 4//6
-f 4//2 3//2 5//2
-f 5//2 3//2 6//2
-f 5//1 6//1 8//1
-f 8//1 6//1 7//1
-f 2//3 7//3 3//3
-f 3//3 7//3 6//3
-f 8//5 7//5 1//5
-f 1//5 7//5 2//5
\ No newline at end of file
diff --git a/renders/ColorInterpolation.png b/renders/ColorInterpolation.png
new file mode 100644
index 0000000..a1e6d45
Binary files /dev/null and b/renders/ColorInterpolation.png differ
diff --git a/renders/ColorInterpolation2.png b/renders/ColorInterpolation2.png
new file mode 100644
index 0000000..9800bb6
Binary files /dev/null and b/renders/ColorInterpolation2.png differ
diff --git a/renders/CullingPersentage.png b/renders/CullingPersentage.png
new file mode 100644
index 0000000..1fc5d6a
Binary files /dev/null and b/renders/CullingPersentage.png differ
diff --git a/renders/NoAA.png b/renders/NoAA.png
new file mode 100644
index 0000000..0ab8828
Binary files /dev/null and b/renders/NoAA.png differ
diff --git a/renders/Performance.png b/renders/Performance.png
new file mode 100644
index 0000000..b84404f
Binary files /dev/null and b/renders/Performance.png differ
diff --git a/renders/Video1.avi b/renders/Video1.avi
new file mode 100644
index 0000000..447419d
Binary files /dev/null and b/renders/Video1.avi differ
diff --git a/renders/Video2.avi b/renders/Video2.avi
new file mode 100644
index 0000000..b891a39
Binary files /dev/null and b/renders/Video2.avi differ
diff --git a/renders/cow.png b/renders/cow.png
new file mode 100644
index 0000000..13d70e0
Binary files /dev/null and b/renders/cow.png differ
diff --git a/renders/cowHighPoly.png b/renders/cowHighPoly.png
new file mode 100644
index 0000000..56b0b12
Binary files /dev/null and b/renders/cowHighPoly.png differ
diff --git a/src/Macros.h b/src/Macros.h
new file mode 100644
index 0000000..61fb040
--- /dev/null
+++ b/src/Macros.h
@@ -0,0 +1,5 @@
+#ifndef MACROS_H
+#define MACROS_H
+#define CullingFlag	//Back-face culling
+#define AA 2		//anti-aliasing
+#endif
\ No newline at end of file
diff --git a/src/main.cpp b/src/main.cpp
index 13d8e67..7f7f6ac 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -2,6 +2,7 @@
 // Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
 
 #include "main.h"
+#include "rasterizeTools.h"
 
 //-------------------------------
 //-------------MAIN--------------
@@ -9,67 +10,67 @@
 
 int main(int argc, char** argv){
 
-  bool loadedScene = false;
-  for(int i=1; i<argc; i++){
-    string header; string data;
-    istringstream liness(argv[i]);
-    getline(liness, header, '='); getline(liness, data, '=');
-    if(strcmp(header.c_str(), "mesh")==0){
-      //renderScene = new scene(data);
-      mesh = new obj();
-      objLoader* loader = new objLoader(data, mesh);
-      mesh->buildVBOs();
-      delete loader;
-      loadedScene = true;
-    }
-  }
-
-  if(!loadedScene){
-    cout << "Usage: mesh=[obj file]" << endl;
-    return 0;
-  }
-
-  frame = 0;
-  seconds = time (NULL);
-  fpstracker = 0;
-
-  // Launch CUDA/GL
-  if (init(argc, argv)) {
-    // GLFW main loop
-    mainLoop();
-  }
-
-  return 0;
+	bool loadedScene = false;
+	for (int i = 1; i < argc; i++){
+		string header; string data;
+		istringstream liness(argv[i]);
+		getline(liness, header, '='); getline(liness, data, '=');
+		if (strcmp(header.c_str(), "mesh") == 0){
+			//renderScene = new scene(data);
+			mesh = new obj();
+			objLoader* loader = new objLoader(data, mesh);
+			mesh->buildVBOs();
+			delete loader;
+			loadedScene = true;
+		}
+	}
+
+	if (!loadedScene){
+		cout << "Usage: mesh=[obj file]" << endl;
+		return 0;
+	}
+
+	frame = 0;
+	seconds = time(NULL);
+	fpstracker = 0;
+
+	// Launch CUDA/GL
+	if (init(argc, argv)) {
+		// GLFW main loop
+		mainLoop();
+	}
+
+	return 0;
 }
 
 void mainLoop() {
-  while(!glfwWindowShouldClose(window)){
-    glfwPollEvents();
-    runCuda();
+	while (!glfwWindowShouldClose(window)){
+		glfwPollEvents();
+		runCuda();
 
-    time_t seconds2 = time (NULL);
+		time_t seconds2 = time(NULL);
 
-    if(seconds2-seconds >= 1){
+		if (seconds2 - seconds >= 1){
 
-        fps = fpstracker/(seconds2-seconds);
-        fpstracker = 0;
-        seconds = seconds2;
-    }
+			fps = fpstracker / (seconds2 - seconds);
+			fpstracker = 0;
+			seconds = seconds2;
+		}
 
-    string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+		string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
 		glfwSetWindowTitle(window, title.c_str());
-    
-    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
-    glBindTexture(GL_TEXTURE_2D, displayImage);
-    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
-    glClear(GL_COLOR_BUFFER_BIT);   
-
-    // VAO, shader program, and texture already bound
-    glDrawElements(GL_TRIANGLES, 6,  GL_UNSIGNED_SHORT, 0);
-    glfwSwapBuffers(window);
-  }
-  glfwDestroyWindow(window);
-  glfwTerminate();
+
+		glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+		glBindTexture(GL_TEXTURE_2D, displayImage);
+		glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
+		glClear(GL_COLOR_BUFFER_BIT);
+
+		// VAO, shader program, and texture already bound
+		glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_SHORT, 0);
+		glfwSwapBuffers(window);
+	}
+	glfwDestroyWindow(window);
+	glfwTerminate();
 }
 
 //-------------------------------
@@ -77,161 +78,171 @@ void mainLoop() {
 //-------------------------------
 
 void runCuda(){
-  // Map OpenGL buffer object for writing from CUDA on a single GPU
-  // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
-  dptr=NULL;
+	// Map OpenGL buffer object for writing from CUDA on a single GPU
+	// No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
+	dptr = NULL;
+
+	vbo = mesh->getVBO();
+	vbosize = mesh->getVBOsize();
+
+	/*float newcbo[] = {	0.0, 1.0, 0.0,
+						0.0, 0.0, 1.0,
+						1.0, 0.0, 0.0 };*/
 
-  vbo = mesh->getVBO();
-  vbosize = mesh->getVBOsize();
+	float newcbo[] = {	1.0f, 1.0f, 1.0f,
+						1.0f, 1.0f, 1.0f,
+						1.0f, 1.0f, 1.0f };
 
-  float newcbo[] = {0.0, 1.0, 0.0, 
-                    0.0, 0.0, 1.0, 
-                    1.0, 0.0, 0.0};
-  cbo = newcbo;
-  cbosize = 9;
+	cbo = newcbo;
+	cbosize = 9;
 
-  ibo = mesh->getIBO();
-  ibosize = mesh->getIBOsize();
+	ibo = mesh->getIBO();
+	ibosize = mesh->getIBOsize();
 
-  cudaGLMapBufferObject((void**)&dptr, pbo);
-  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize);
-  cudaGLUnmapBufferObject(pbo);
+	nbo = mesh->getNBO();
+	nbosize = mesh->getNBOsize();
+	cudaGLMapBufferObject((void**)&dptr, pbo);
+	
+	//cam.PMat = glm::perspective(cam.fov.y, float(width / height), cam.depth.x, cam.depth.y);
+	cudaRasterizeCore(dptr, cam, frame, vbo, vbosize, cbo, cbosize, ibo, ibosize,nbo,nbosize);
+	cudaGLUnmapBufferObject(pbo);
 
-  vbo = NULL;
-  cbo = NULL;
-  ibo = NULL;
+	vbo = NULL;
+	cbo = NULL;
+	ibo = NULL;
 
-  frame++;
-  fpstracker++;
+	frame++;
+	fpstracker++;
 
 }
-  
+
 //-------------------------------
 //----------SETUP STUFF----------
 //-------------------------------
 
 bool init(int argc, char* argv[]) {
-  glfwSetErrorCallback(errorCallback);
-
-  if (!glfwInit()) {
-      return false;
-  }
-
-  width = 800;
-  height = 800;
-  window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
-  if (!window){
-      glfwTerminate();
-      return false;
-  }
-  glfwMakeContextCurrent(window);
-  glfwSetKeyCallback(window, keyCallback);
-
-  // Set up GL context
-  glewExperimental = GL_TRUE;
-  if(glewInit()!=GLEW_OK){
-    return false;
-  }
-
-  // Initialize other stuff
-  initVAO();
-  initTextures();
-  initCuda();
-  initPBO();
-  
-  GLuint passthroughProgram;
-  passthroughProgram = initShader();
-
-  glUseProgram(passthroughProgram);
-  glActiveTexture(GL_TEXTURE0);
-
-  return true;
+	glfwSetErrorCallback(errorCallback);
+
+	if (!glfwInit()) {
+		return false;
+	}
+
+	width = 800;
+	height = 800;
+	window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
+	if (!window){
+		glfwTerminate();
+		return false;
+	}
+	glfwMakeContextCurrent(window);
+	glfwSetKeyCallback(window, keyCallback);
+	glfwSetMouseButtonCallback(window,mouseClick);
+	glfwSetScrollCallback(window, mouseWheel);
+	// Set up GL context
+	glewExperimental = GL_TRUE;
+	if (glewInit() != GLEW_OK){
+		return false;
+	}
+
+	// Initialize other stuff
+	initVAO();
+	initTextures();
+	initCuda();
+	initPBO();
+
+	GLuint passthroughProgram;
+	passthroughProgram = initShader();
+
+	glUseProgram(passthroughProgram);
+	glActiveTexture(GL_TEXTURE0);
+
+	return true;
 }
 
 void initPBO(){
-  // set up vertex data parameter
-  int num_texels = width*height;
-  int num_values = num_texels * 4;
-  int size_tex_data = sizeof(GLubyte) * num_values;
-    
-  // Generate a buffer ID called a PBO (Pixel Buffer Object)
-  glGenBuffers(1, &pbo);
+	// set up vertex data parameter
+	int num_texels = width*height;
+	int num_values = num_texels * 4;
+	int size_tex_data = sizeof(GLubyte) * num_values;
+
+	// Generate a buffer ID called a PBO (Pixel Buffer Object)
+	glGenBuffers(1, &pbo);
 
-  // Make this the current UNPACK buffer (OpenGL is state-based)
-  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
+	// Make this the current UNPACK buffer (OpenGL is state-based)
+	glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
 
-  // Allocate data for the buffer. 4-channel 8-bit image
-  glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
-  cudaGLRegisterBufferObject(pbo);
+	// Allocate data for the buffer. 4-channel 8-bit image
+	glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
+	cudaGLRegisterBufferObject(pbo);
 
 }
 
 void initCuda(){
-  // Use device with highest Gflops/s
-  cudaGLSetGLDevice(0);
+	// Use device with highest Gflops/s
+	cudaGLSetGLDevice(0);
 
-  // Clean up on program exit
-  atexit(cleanupCuda);
+	// Clean up on program exit
+	atexit(cleanupCuda);
 }
 
 void initTextures(){
-    glGenTextures(1, &displayImage);
-    glBindTexture(GL_TEXTURE_2D, displayImage);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
-    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
-    glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
-        GL_UNSIGNED_BYTE, NULL);
+	glGenTextures(1, &displayImage);
+	glBindTexture(GL_TEXTURE_2D, displayImage);
+	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
+	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
+	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA,
+		GL_UNSIGNED_BYTE, NULL);
 }
 
 void initVAO(void){
-    GLfloat vertices[] =
-    { 
-        -1.0f, -1.0f, 
-         1.0f, -1.0f, 
-         1.0f,  1.0f, 
-        -1.0f,  1.0f, 
-    };
-
-    GLfloat texcoords[] = 
-    { 
-        1.0f, 1.0f,
-        0.0f, 1.0f,
-        0.0f, 0.0f,
-        1.0f, 0.0f
-    };
-
-    GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
-
-    GLuint vertexBufferObjID[3];
-    glGenBuffers(3, vertexBufferObjID);
-    
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0); 
-    glEnableVertexAttribArray(positionLocation);
-
-    glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
-    glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
-    glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
-    glEnableVertexAttribArray(texcoordsLocation);
-
-    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
-    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
+	GLfloat vertices[] =
+	{
+		-1.0f, -1.0f,
+		1.0f, -1.0f,
+		1.0f, 1.0f,
+		-1.0f, 1.0f,
+	};
+
+	GLfloat texcoords[] =
+	{
+		1.0f, 1.0f,
+		0.0f, 1.0f,
+		0.0f, 0.0f,
+		1.0f, 0.0f
+	};
+
+	GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
+
+	GLuint vertexBufferObjID[3];
+	glGenBuffers(3, vertexBufferObjID);
+
+	glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[0]);
+	glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
+	glVertexAttribPointer((GLuint)positionLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+	glEnableVertexAttribArray(positionLocation);
+
+	glBindBuffer(GL_ARRAY_BUFFER, vertexBufferObjID[1]);
+	glBufferData(GL_ARRAY_BUFFER, sizeof(texcoords), texcoords, GL_STATIC_DRAW);
+	glVertexAttribPointer((GLuint)texcoordsLocation, 2, GL_FLOAT, GL_FALSE, 0, 0);
+	glEnableVertexAttribArray(texcoordsLocation);
+
+	glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vertexBufferObjID[2]);
+	glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
 }
 
 
 GLuint initShader() {
-  const char *attribLocations[] = { "Position", "Tex" };
-  GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
-  GLint location;
-  
-  glUseProgram(program);
-  if ((location = glGetUniformLocation(program, "u_image")) != -1)
-  {
-    glUniform1i(location, 0);
-  }
-
-  return program;
+	const char *attribLocations[] = { "Position", "Tex" };
+	GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
+	GLint location;
+
+	glUseProgram(program);
+	if ((location = glGetUniformLocation(program, "u_image")) != -1)
+	{
+		glUniform1i(location, 0);
+	}
+
+	return program;
 }
 
 //-------------------------------
@@ -239,34 +250,34 @@ GLuint initShader() {
 //-------------------------------
 
 void cleanupCuda(){
-  if(pbo) deletePBO(&pbo);
-  if(displayImage) deleteTexture(&displayImage);
+	if (pbo) deletePBO(&pbo);
+	if (displayImage) deleteTexture(&displayImage);
 }
 
 void deletePBO(GLuint* pbo){
-  if (pbo) {
-    // unregister this buffer object with CUDA
-    cudaGLUnregisterBufferObject(*pbo);
-    
-    glBindBuffer(GL_ARRAY_BUFFER, *pbo);
-    glDeleteBuffers(1, pbo);
-    
-    *pbo = (GLuint)NULL;
-  }
+	if (pbo) {
+		// unregister this buffer object with CUDA
+		cudaGLUnregisterBufferObject(*pbo);
+
+		glBindBuffer(GL_ARRAY_BUFFER, *pbo);
+		glDeleteBuffers(1, pbo);
+
+		*pbo = (GLuint)NULL;
+	}
 }
 
 void deleteTexture(GLuint* tex){
-    glDeleteTextures(1, tex);
-    *tex = (GLuint)NULL;
+	glDeleteTextures(1, tex);
+	*tex = (GLuint)NULL;
 }
- 
+
 void shut_down(int return_code){
-  kernelCleanup();
-  cudaDeviceReset();
-  #ifdef __APPLE__
-  glfwTerminate();
-  #endif
-  exit(return_code);
+	kernelCleanup();
+	cudaDeviceReset();
+#ifdef __APPLE__
+	glfwTerminate();
+#endif
+	exit(return_code);
 }
 
 //------------------------------
@@ -274,11 +285,81 @@ void shut_down(int return_code){
 //------------------------------
 
 void errorCallback(int error, const char* description){
-    fputs(description, stderr);
+	fputs(description, stderr);
 }
 
 void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods){
-    if(key == GLFW_KEY_ESCAPE && action == GLFW_PRESS){
-        glfwSetWindowShouldClose(window, GL_TRUE);
-    }
-}
\ No newline at end of file
+	if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS){
+		glfwSetWindowShouldClose(window, GL_TRUE);
+	}
+}
+
+void mouseClick(GLFWwindow* window, int button, int action, int mods)
+{
+	double xpos, ypos;
+	glfwGetCursorPos(window, &xpos, &ypos);
+	double dx, dy;
+	dx = (double)(xpos - mouse_old_x);
+	dy = (double)(ypos - mouse_old_y);
+	if (action == GLFW_RELEASE&&button == GLFW_MOUSE_BUTTON_LEFT){
+		viewPhi += dx * 0.002;
+		viewTheta -= dy * 0.002;
+		viewTheta = glm::clamp(viewTheta, float(1e-6), float(PI - (1e-6)));
+		cam.pos = glm::vec3(r*sin(viewTheta)*cos(viewPhi), r*cos(viewTheta), r*sin(viewTheta)*sin(viewPhi));
+		cam.view = glm::normalize(-cam.pos);
+		cam.update();
+	}
+	else if (action == GLFW_RELEASE&&button == GLFW_MOUSE_BUTTON_RIGHT){
+			cam.pos.z += 0.002*dy;
+			cam.pos.z += 0.002*dy;
+			cam.update();
+
+	}
+
+	mouse_old_x = xpos;
+	mouse_old_y = ypos;
+	//mouse_old_x = x;
+	//mouse_old_y = y;
+}
+
+void mouseWheel(GLFWwindow* window, double x, double y){
+	r -= y>0 ? 0.1f : -0.1f;
+	r = glm::clamp(r, cam.depth.x, cam.depth.y);
+	cam.pos = glm::vec3(r*sin(viewTheta)*cos(viewPhi), r*cos(viewTheta), r*sin(viewTheta)*sin(viewPhi));
+	cam.view = glm::normalize(-cam.pos);
+	cam.update();
+}
+
+void mouseMotion(int x, int y)
+{
+	float dx, dy;
+	dx = (float)(x - mouse_old_x);
+	dy = (float)(y - mouse_old_y);
+
+	if (button_mask & 0x01)
+	{// left button
+		viewPhi -= dx * 0.002f;
+		viewTheta -= dy * 0.002f;
+		viewTheta = glm::clamp(viewTheta, float(1e-6), float(PI - (1e-6)));
+		cam.pos = glm::vec3(r*sin(viewTheta)*sin(viewPhi), r*cos(viewTheta) + (cam.view.y*cam.depth.x+cam.pos.y), r*sin(viewTheta)*cos(viewPhi));
+		cam.update();
+		
+	}
+	if (button_mask & 0x02)
+	{// middle button
+		cam.pos.y += 0.02f*dy;
+		cam.pos.y += 0.02f*dy;
+		cam.update();
+	}
+
+	mouse_old_x = x;
+	mouse_old_y = y;
+}
+
+//void mouseWheel(int button, int dir, int x, int y)
+//{
+//	r -= dir>0 ? 0.1f : -0.1f;
+//	r = glm::clamp(r, cam.depth.x, cam.depth.y);
+//	cam.pos = glm::vec3(r*sin(viewTheta)*sin(viewPhi), r*cos(viewTheta) + (cam.view.y*cam.depth.x + cam.pos.y), r*sin(viewTheta)*cos(viewPhi));
+//	cam.update();
+//}
\ No newline at end of file
diff --git a/src/main.h b/src/main.h
index 8999110..8d9ad7a 100644
--- a/src/main.h
+++ b/src/main.h
@@ -7,6 +7,9 @@
 #include <GL/glew.h>
 #include <GLFW/glfw3.h>
 
+#include <GL/glew.h>
+#include <GL/freeglut.h>
+
 #include <cuda_runtime.h>
 #include <cuda_gl_interop.h>
 #include <fstream>
@@ -50,12 +53,14 @@ int cbosize;
 int* ibo;
 int ibosize;
 
+float* nbo;
+int nbosize;
 //-------------------------------
 //----------CUDA STUFF-----------
 //-------------------------------
 
 int width = 800; int height = 800;
-
+Camera cam(glm::vec2(width, height));
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -68,11 +73,23 @@ int main(int argc, char** argv);
 
 void runCuda();
 
+//-------------------------------
+//----------Mouse Control--------
+//-------------------------------
+
+double mouse_old_x, mouse_old_y;
+unsigned char button_mask = 0x00;
+
+float viewPhi = 0.0f;
+float viewTheta = PI/2.0f;
+float r = glm::length(cam.pos);
+
 #ifdef __APPLE__
-	void display();
+void display();
 #else
-	void display();
-	void keyboard(unsigned char key, int x, int y);
+void display();
+void keyboard(unsigned char key, int x, int y);
+
 #endif
 
 //-------------------------------
@@ -99,5 +116,8 @@ void deleteTexture(GLuint* tex);
 void mainLoop();
 void errorCallback(int error, const char *description);
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods);
-
+void mouseClick(GLFWwindow* window, int button, int action, int mods);
+void mouseMotion(int x, int y);
+//void mouseWheel(int button, int dir, int x, int y);
+void mouseWheel(GLFWwindow* window, double x, double y);
 #endif
\ No newline at end of file
diff --git a/src/rasterizeKernels.cu b/src/rasterizeKernels.cu
index 10b0000..c796f2f 100644
--- a/src/rasterizeKernels.cu
+++ b/src/rasterizeKernels.cu
@@ -1,266 +1,578 @@
 // CIS565 CUDA Rasterizer: A simple rasterization pipeline for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
 // Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
 
+
 #include <stdio.h>
 #include <cuda.h>
 #include <cmath>
 #include <thrust/random.h>
+#include <thrust/device_ptr.h>
+#include <thrust/device_vector.h>
+#include <thrust/remove.h>
 #include "rasterizeKernels.h"
-#include "rasterizeTools.h"
+//#include "rasterizeTools.h"
+
+#include "Macros.h"
 
 glm::vec3* framebuffer;
 fragment* depthbuffer;
 float* device_vbo;
+float* locvbo;
 float* device_cbo;
+float* device_nbo;
 int* device_ibo;
+float* dBuff;
+int* dBuffLock;
 triangle* primitives;
 
 void checkCUDAError(const char *msg) {
-  cudaError_t err = cudaGetLastError();
-  if( cudaSuccess != err) {
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 
-    exit(EXIT_FAILURE); 
-  }
-} 
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
+}
+
+__device__ bool DevepsilonCheck(float a, float b){
+	if (fabs(fabs(a) - fabs(b))<.000000001){
+		return true;
+	}
+	else{
+		return false;
+	}
+}
 
 //Handy dandy little hashing function that provides seeds for random number generation
 __host__ __device__ unsigned int hash(unsigned int a){
-    a = (a+0x7ed55d16) + (a<<12);
-    a = (a^0xc761c23c) ^ (a>>19);
-    a = (a+0x165667b1) + (a<<5);
-    a = (a+0xd3a2646c) ^ (a<<9);
-    a = (a+0xfd7046c5) + (a<<3);
-    a = (a^0xb55a4f09) ^ (a>>16);
-    return a;
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
 }
 
 //Writes a given fragment to a fragment buffer at a given location
 __host__ __device__ void writeToDepthbuffer(int x, int y, fragment frag, fragment* depthbuffer, glm::vec2 resolution){
-  if(x<resolution.x && y<resolution.y){
-    int index = (y*resolution.x) + x;
-    depthbuffer[index] = frag;
-  }
+	if (x < resolution.x && y < resolution.y){
+		int index = (y*resolution.x) + x;
+		depthbuffer[index] = frag;
+	}
 }
 
 //Reads a fragment from a given location in a fragment buffer
 __host__ __device__ fragment getFromDepthbuffer(int x, int y, fragment* depthbuffer, glm::vec2 resolution){
-  if(x<resolution.x && y<resolution.y){
-    int index = (y*resolution.x) + x;
-    return depthbuffer[index];
-  }else{
-    fragment f;
-    return f;
-  }
+	if (x < resolution.x && y < resolution.y){
+		int index = (y*resolution.x) + x;
+		return depthbuffer[index];
+	}
+	else{
+		fragment f;
+		return f;
+	}
 }
 
 //Writes a given pixel to a pixel buffer at a given location
 __host__ __device__ void writeToFramebuffer(int x, int y, glm::vec3 value, glm::vec3* framebuffer, glm::vec2 resolution){
-  if(x<resolution.x && y<resolution.y){
-    int index = (y*resolution.x) + x;
-    framebuffer[index] = value;
-  }
+	if (x < resolution.x && y < resolution.y){
+		int index = (y*resolution.x) + x;
+		framebuffer[index] = value;
+	}
 }
 
 //Reads a pixel from a pixel buffer at a given location
 __host__ __device__ glm::vec3 getFromFramebuffer(int x, int y, glm::vec3* framebuffer, glm::vec2 resolution){
-  if(x<resolution.x && y<resolution.y){
-    int index = (y*resolution.x) + x;
-    return framebuffer[index];
-  }else{
-    return glm::vec3(0,0,0);
-  }
+	if (x < resolution.x && y < resolution.y){
+		int index = (y*resolution.x) + x;
+		return framebuffer[index];
+	}
+	else{
+		return glm::vec3(0, 0, 0);
+	}
+}
+
+__host__ __device__ float lcalculateSignedArea(triangle tri){
+	return 0.5*((tri.p1.x - tri.p0.x)*(tri.p2.y - tri.p0.y) - (tri.p2.x - tri.p0.x)*(tri.p1.y - tri.p0.y));
+}
+
+__host__ __device__ float lcalculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, triangle tri){
+	triangle baryTri;
+	baryTri.p0 = glm::vec3(a, 0); baryTri.p1 = glm::vec3(b, 0); baryTri.p2 = glm::vec3(c, 0);
+	return lcalculateSignedArea(baryTri) / lcalculateSignedArea(tri);
+}
+
+__host__ __device__ glm::vec3 lcalculateBarycentricCoordinate(triangle tri, glm::vec2 point){
+	float beta = lcalculateBarycentricCoordinateValue(glm::vec2(tri.p0.x, tri.p0.y), point, glm::vec2(tri.p2.x, tri.p2.y), tri);
+	float gamma = lcalculateBarycentricCoordinateValue(glm::vec2(tri.p0.x, tri.p0.y), glm::vec2(tri.p1.x, tri.p1.y), point, tri);
+	float alpha = 1.0 - beta - gamma;
+	return glm::vec3(alpha, beta, gamma);
+}
+
+__host__ __device__ bool lisBarycentricCoordInBounds(glm::vec3 barycentricCoord){
+	return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
+		barycentricCoord.y >= 0.0 && barycentricCoord.y <= 1.0 &&
+		barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
+}
+
+__host__ __device__ void lgetAABBForTriangle(triangle tri, glm::vec3& minpoint, glm::vec3& maxpoint){
+	minpoint = glm::vec3(min(min(tri.p0.x, tri.p1.x), tri.p2.x),
+		min(min(tri.p0.y, tri.p1.y), tri.p2.y),
+		min(min(tri.p0.z, tri.p1.z), tri.p2.z));
+	maxpoint = glm::vec3(max(max(tri.p0.x, tri.p1.x), tri.p2.x),
+		max(max(tri.p0.y, tri.p1.y), tri.p2.y),
+		max(max(tri.p0.z, tri.p1.z), tri.p2.z));
 }
 
 //Kernel that clears a given pixel buffer with a given color
 __global__ void clearImage(glm::vec2 resolution, glm::vec3* image, glm::vec3 color){
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * resolution.x);
-    if(x<=resolution.x && y<=resolution.y){
-      image[index] = color;
-    }
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+	if (x <= resolution.x && y <= resolution.y){
+		image[index] = color;
+	}
 }
 
 //Kernel that clears a given fragment buffer with a given fragment
-__global__ void clearDepthBuffer(glm::vec2 resolution, fragment* buffer, fragment frag){
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * resolution.x);
-    if(x<=resolution.x && y<=resolution.y){
-      fragment f = frag;
-      f.position.x = x;
-      f.position.y = y;
-      buffer[index] = f;
-    }
+__global__ void clearDepthBuffer(Camera cam, fragment* buffer, fragment frag,float* dBuff, int* dBuffLock){
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	glm::vec2 resolution = cam.reso;
+	int index = x + (y * resolution.x);
+	if (x <= resolution.x && y <= resolution.y){
+		fragment f = frag;
+		f.position.x = x;
+		f.position.y = y;
+		buffer[index] = f;
+
+		dBuff[index] = cam.depth.y;
+		dBuffLock[index] = 0;
+	}
 }
 
 //Kernel that writes the image to the OpenGL PBO directly. 
 __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3* image){
-  
-  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-  int index = x + (y * resolution.x);
-  
-  if(x<=resolution.x && y<=resolution.y){
-
-      glm::vec3 color;      
-      color.x = image[index].x*255.0;
-      color.y = image[index].y*255.0;
-      color.z = image[index].z*255.0;
-
-      if(color.x>255){
-        color.x = 255;
-      }
-
-      if(color.y>255){
-        color.y = 255;
-      }
-
-      if(color.z>255){
-        color.z = 255;
-      }
-      
-      // Each thread writes one pixel location in the texture (textel)
-      PBOpos[index].w = 0;
-      PBOpos[index].x = color.x;     
-      PBOpos[index].y = color.y;
-      PBOpos[index].z = color.z;
-  }
+
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+
+	if (x <= resolution.x && y <= resolution.y){
+
+		glm::vec3 color;
+		color.x = image[index].x*255.0;
+		color.y = image[index].y*255.0;
+		color.z = image[index].z*255.0;
+
+		if (color.x > 255){
+			color.x = 255;
+		}
+
+		if (color.y > 255){
+			color.y = 255;
+		}
+
+		if (color.z > 255){
+			color.z = 255;
+		}
+
+		// Each thread writes one pixel location in the texture (textel)
+		PBOpos[index].w = 0;
+		PBOpos[index].x = color.x;
+		PBOpos[index].y = color.y;
+		PBOpos[index].z = color.z;
+	}
 }
 
 //TODO: Implement a vertex shader
-__global__ void vertexShadeKernel(float* vbo, int vbosize){
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if(index<vbosize/3){
-  }
+__global__ void vertexShadeKernel(Camera cam, float* vbo, int vbosize, float* nbo, int nbosize, float* locvbo){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < vbosize / 3){
+		glm::vec4 v(vbo[3 * index], vbo[3 * index + 1], vbo[3 * index + 2], 1.0f);
+		glm::vec4 n(nbo[3 * index], nbo[3 * index + 1], nbo[3 * index + 2], 1.0f);
+		v = cam.World2LocalMat*v;
+		locvbo[3 * index] = v.x;
+		locvbo[3 * index + 1] = v.y;
+		locvbo[3 * index + 2] = v.z;
+
+		v = cam.PMat*v;
+		v /= v.w;
+		vbo[3 * index] = cam.reso.x*.5f*(v.x + 1.0f);
+		vbo[3 * index + 1] = cam.reso.y*.5f*(v.y + 1.0f);
+		vbo[3 * index + 2] = (cam.depth.y - cam.depth.x)*.5f*v.z + (cam.depth.y + cam.depth.x)*.5f;
+
+		n = glm::transpose(cam.local2WorldMat)*n;
+		nbo[3 * index] = n.x;
+		nbo[3 * index + 1] = n.y;
+		nbo[3 * index + 2] = n.z;
+	}
 }
 
 //TODO: Implement primative assembly
-__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives){
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  int primitivesCount = ibosize/3;
-  if(index<primitivesCount){
-  }
+__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives, float * locvbo, float*nbo,Camera cam){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int primitivesCount = ibosize / 3;
+	if (index < primitivesCount){
+		//premitive assign
+		primitives[index].p0.x = vbo[3 * ibo[3 * index]];
+		primitives[index].p0.y = vbo[3 * ibo[3 * index] + 1];
+		primitives[index].p0.z = vbo[3 * ibo[3 * index] + 2];
+
+		primitives[index].p1.x = vbo[3 * ibo[3 * index + 1]];
+		primitives[index].p1.y = vbo[3 * ibo[3 * index + 1] + 1];
+		primitives[index].p1.z = vbo[3 * ibo[3 * index + 1] + 2];
+
+		primitives[index].p2.x = vbo[3 * ibo[3 * index + 2]];
+		primitives[index].p2.y = vbo[3 * ibo[3 * index + 2] + 1];
+		primitives[index].p2.z = vbo[3 * ibo[3 * index + 2] + 2];
+		//local coordinates assign
+		primitives[index].locp0.x = locvbo[3 * ibo[3 * index]];
+		primitives[index].locp0.y = locvbo[3 * ibo[3 * index] + 1];
+		primitives[index].locp0.z = locvbo[3 * ibo[3 * index] + 2];
+
+		primitives[index].locp1.x = locvbo[3 * ibo[3 * index + 1]];
+		primitives[index].locp1.y = locvbo[3 * ibo[3 * index + 1] + 1];
+		primitives[index].locp1.z = locvbo[3 * ibo[3 * index + 1] + 2];
+
+		primitives[index].locp2.x = locvbo[3 * ibo[3 * index + 2]];
+		primitives[index].locp2.y = locvbo[3 * ibo[3 * index + 2] + 1];
+		primitives[index].locp2.z = locvbo[3 * ibo[3 * index + 2] + 2];
+		//normal assign
+		primitives[index].locn0.x = nbo[3 * ibo[3 * index]];
+		primitives[index].locn0.y = nbo[3 * ibo[3 * index] + 1];
+		primitives[index].locn0.z = nbo[3 * ibo[3 * index] + 2];
+
+		primitives[index].locn1.x = nbo[3 * ibo[3 * index + 1]];
+		primitives[index].locn1.y = nbo[3 * ibo[3 * index + 1] + 1];
+		primitives[index].locn1.z = nbo[3 * ibo[3 * index + 1] + 2];
+
+		primitives[index].locn2.x = nbo[3 * ibo[3 * index + 2]];
+		primitives[index].locn2.y = nbo[3 * ibo[3 * index + 2] + 1];
+		primitives[index].locn2.z = nbo[3 * ibo[3 * index + 2] + 2];
+
+		//color assign
+		primitives[index].c0.x = cbo[0];
+		primitives[index].c0.y = cbo[1];
+		primitives[index].c0.z = cbo[2];
+
+		primitives[index].c1.x = cbo[3];
+		primitives[index].c1.y = cbo[4];
+		primitives[index].c1.z = cbo[5];
+
+		primitives[index].c2.x = cbo[6];
+		primitives[index].c2.y = cbo[7];
+		primitives[index].c2.z = cbo[8];
+
+		primitives[index].CFlag = false;
+
+#ifdef CullingFlag
+		if (lcalculateSignedArea(primitives[index])< -1e-6) primitives[index].CFlag = true; // back facing triangles
+		else    // triangles totally outside of screen
+		{
+			glm::vec3 tMin, tMax;
+			glm::vec2 resolution(cam.reso);
+			lgetAABBForTriangle(primitives[index], tMin, tMax);
+			if (tMin.x > resolution.x ||
+				tMin.y > resolution.y ||
+				tMin.z > cam.depth.y ||
+				tMax.x < 0 ||
+				tMax.y < 0 ||
+				tMax.z < cam.depth.x)
+				primitives[index].CFlag = true;
+		}
+#endif
+	}
 }
 
+
+
+
+
 //TODO: Implement a rasterization method, such as scanline.
-__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if(index<primitivesCount){
-  }
+__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, Camera cam, float* dBuff, int* dBuffLock){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < primitivesCount){
+		glm::vec2 resolution = cam.reso;
+		triangle curTri = primitives[index];
+
+		if (DevepsilonCheck(lcalculateSignedArea(curTri), 0.0f)) return;
+		else{
+			glm::vec3 tMin, tMax;
+			lgetAABBForTriangle(curTri, tMin, tMax);
+
+			if (tMin.x > resolution.x ||
+				tMin.y > resolution.y ||
+				tMin.z > cam.depth.y ||
+				tMax.x < 0 ||
+				tMax.y < 0 ||
+				tMax.z < cam.depth.x)
+				return;
+
+			glm::vec2 PC;
+			float depth;
+			glm::vec3 BC;
+			int pixelIndex;
+			for (int j = max(int(tMin.y), 0); j < min(int(tMax.y + 1), int(resolution.y)); j++)
+			{
+				glm::vec2 Q0(tMin.x, float(j + 0.5));
+				glm::vec2 Q1(tMax.x, float(j + 0.5));
+				glm::vec2 u = Q1 - Q0;
+				float s;
+				float t;
+				float minS = 1.0f, maxS = 0.0f;
+
+				glm::vec3 p( curTri.p1 - curTri.p0);
+				glm::vec2 v0(p.x,p.y);
+				p = curTri.p2 - curTri.p1;
+				glm::vec2 v1(p.x, p.y);
+				p = curTri.p0 - curTri.p2;
+				glm::vec2 v2(p.x,p.y);
+
+				glm::vec2 w;
+				if (!DevepsilonCheck(u.x*v0.y - u.y*v0.x, 0)) // not parallel
+				{
+					w = Q0 - glm::vec2(curTri.p0.x, curTri.p0.y);
+					s = (v0.y*w.x - v0.x*w.y) / (v0.x*u.y - v0.y*u.x);
+					t = (u.x*w.y - u.y*w.x) / (u.x*v0.y - u.y*v0.x);
+					if (s > -1e-6 && s < 1 + 1e-6 && t > -1e-6 && t < 1 + 1e-6)
+					{
+						minS = fminf(s, minS);
+						maxS = fmaxf(s, maxS);
+					}
+				}
+
+				if (!DevepsilonCheck(u.x*v1.y - u.y*v1.x, 0)) // not parallel
+				{
+					w = Q0 - glm::vec2(curTri.p1.x, curTri.p1.y);
+					s = (v1.y*w.x - v1.x*w.y) / (v1.x*u.y - v1.y*u.x);
+					t = (u.x*w.y - u.y*w.x) / (u.x*v1.y - u.y*v1.x);
+					if (s > -1e-6 && s < 1 + 1e-6 && t > -1e-6 && t < 1 + 1e-6)
+					{
+						minS = fminf(s, minS);
+						maxS = fmaxf(s, maxS);
+					}
+				}
+
+				if (!DevepsilonCheck(u.x*v2.y - u.y*v2.x, 0)) // not parallel
+				{
+					w = Q0 - glm::vec2(curTri.p2.x, curTri.p2.y);
+					s = (v2.y*w.x - v2.x*w.y) / (v2.x*u.y - v2.y*u.x);
+					t = (u.x*w.y - u.y*w.x) / (u.x*v2.y - u.y*v2.x);
+					if (s > -1e-6 && s < 1 + 1e-6 && t > -1e-6 && t < 1 + 1e-6)
+					{
+						minS = fminf(s, minS);
+						maxS = fmaxf(s, maxS);
+					}
+				}
+
+				for (int i = max(int(tMin.x + minS * u.x), 0); i < min(int(tMin.x + maxS * u.x + 1), int(resolution.x)); i++)
+				{
+					PC = glm::vec2(float(i + 0.5), float(j + 0.5));
+					BC = lcalculateBarycentricCoordinate(curTri, PC);
+					depth = BC.x * curTri.p0.z + BC.y * curTri.p1.z + BC.z * curTri.p2.z;
+					pixelIndex = resolution.x - 1 - i + ((resolution.y - 1 - j) * resolution.x);
+
+					if (lisBarycentricCoordInBounds(BC) && depth > cam.depth.x && depth < cam.depth.y)
+					{
+						bool wait = true;
+
+						while (wait)
+						{
+							if (0 == atomicExch(&dBuffLock[pixelIndex], 1))
+							{
+
+								if (depth < dBuff[pixelIndex])
+								{
+									dBuff[pixelIndex] = depth;
+
+									depthbuffer[pixelIndex].position = BC.x * curTri.locp0 + BC.y * curTri.locp1 + BC.z * curTri.locp2;
+									depthbuffer[pixelIndex].normal = BC.x * curTri.locn0 + BC.y * curTri.locn1 + BC.z * curTri.locn2;
+									depthbuffer[pixelIndex].color = BC.x * curTri.c0 + BC.y * curTri.c1 + BC.z * curTri.c2;
+								}
+								dBuffLock[pixelIndex] = 0;
+								wait = false;
+							}
+						}
+					}
+				}
+			}
+
+
+		}
+	}
 }
 
 //TODO: Implement a fragment shader
-__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution){
-  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-  int index = x + (y * resolution.x);
-  if(x<=resolution.x && y<=resolution.y){
-  }
+__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution, glm::vec3 lPos){
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+	if (x <= resolution.x && y <= resolution.y){
+		float specular = 10.0;
+		float ka = 0.3;
+		float kd = 0.7;
+		float ks = 0.1;
+		fragment curFrag = depthbuffer[index];
+
+		glm::vec3 L = glm::normalize(lPos - curFrag.position);
+		glm::vec3 normal = glm::normalize(curFrag.normal);
+		float diffuseTerm = glm::clamp(glm::dot(normal, L), 0.0f, 1.0f);
+
+		glm::vec3 V = glm::normalize(-curFrag.position);
+		glm::vec3 H = (L + V) / 2.0f;
+
+		float specularTerm = pow(fmaxf(glm::dot(normal, H), 0.0f), specular);
+
+		depthbuffer[index].color = ka*curFrag.color + kd*curFrag.color*diffuseTerm + ks*specularTerm;
+	}
 }
 
+__global__ void AntiAliasing(){}
 //Writes fragment colors to the framebuffer
 __global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* framebuffer){
 
-  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-  int index = x + (y * resolution.x);
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
 
-  if(x<=resolution.x && y<=resolution.y){
-    framebuffer[index] = depthbuffer[index].color;
-  }
+	if (x <= resolution.x && y <= resolution.y){
+		framebuffer[index] = depthbuffer[index].color;
+	}
 }
 
 // Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
-void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize){
-
-  // set up crucial magic
-  int tileSize = 8;
-  dim3 threadsPerBlock(tileSize, tileSize);
-  dim3 fullBlocksPerGrid((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
-
-  //set up framebuffer
-  framebuffer = NULL;
-  cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
-  
-  //set up depthbuffer
-  depthbuffer = NULL;
-  cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
-
-  //kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
-  clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
-  
-  fragment frag;
-  frag.color = glm::vec3(0,0,0);
-  frag.normal = glm::vec3(0,0,0);
-  frag.position = glm::vec3(0,0,-10000);
-  clearDepthBuffer<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer,frag);
-
-  //------------------------------
-  //memory stuff
-  //------------------------------
-  primitives = NULL;
-  cudaMalloc((void**)&primitives, (ibosize/3)*sizeof(triangle));
-
-  device_ibo = NULL;
-  cudaMalloc((void**)&device_ibo, ibosize*sizeof(int));
-  cudaMemcpy( device_ibo, ibo, ibosize*sizeof(int), cudaMemcpyHostToDevice);
-
-  device_vbo = NULL;
-  cudaMalloc((void**)&device_vbo, vbosize*sizeof(float));
-  cudaMemcpy( device_vbo, vbo, vbosize*sizeof(float), cudaMemcpyHostToDevice);
-
-  device_cbo = NULL;
-  cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
-  cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
-
-  tileSize = 32;
-  int primitiveBlocks = ceil(((float)vbosize/3)/((float)tileSize));
-
-  //------------------------------
-  //vertex shader
-  //------------------------------
-  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //primitive assembly
-  //------------------------------
-  primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
-  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //rasterization
-  //------------------------------
-  rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives, ibosize/3, depthbuffer, resolution);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //fragment shader
-  //------------------------------
-  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
-
-  cudaDeviceSynchronize();
-  //------------------------------
-  //write fragments to framebuffer
-  //------------------------------
-  render<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer, framebuffer);
-  sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, resolution, framebuffer);
-
-  cudaDeviceSynchronize();
-
-  kernelCleanup();
-
-  checkCUDAError("Kernel failed!");
+void cudaRasterizeCore(uchar4* PBOpos, Camera cam, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize){
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+	cudaEventRecord(start, 0);
+	glm::vec2 resolution = cam.reso;
+	// set up crucial magic
+	int tileSize = 8;
+	dim3 threadsPerBlock(tileSize, tileSize);
+	dim3 fullBlocksPerGrid((int)ceil(float(resolution.x) / float(tileSize)), (int)ceil(float(resolution.y) / float(tileSize)));
+
+	//set up framebuffer
+	framebuffer = NULL;
+	cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
+
+	//set up depthbuffer
+	depthbuffer = NULL;
+	cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
+
+	//kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
+	clearImage << <fullBlocksPerGrid, threadsPerBlock >> >(resolution, framebuffer, glm::vec3(0, 0, 0));
+
+	
+
+	//------------------------------
+	//memory stuff
+	//------------------------------
+	primitives = NULL;
+	cudaMalloc((void**)&primitives, (ibosize / 3)*sizeof(triangle));
+
+	device_ibo = NULL;
+	cudaMalloc((void**)&device_ibo, ibosize*sizeof(int));
+	cudaMemcpy(device_ibo, ibo, ibosize*sizeof(int), cudaMemcpyHostToDevice);
+
+	device_vbo = NULL;
+	cudaMalloc((void**)&device_vbo, vbosize*sizeof(float));
+	cudaMemcpy(device_vbo, vbo, vbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	device_cbo = NULL;
+	cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
+	cudaMemcpy(device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	device_nbo = NULL;
+	cudaMalloc((void**)&device_nbo, nbosize*sizeof(float));
+	cudaMemcpy(device_nbo, nbo, nbosize*sizeof(float), cudaMemcpyHostToDevice);
+
+	locvbo = NULL;
+	cudaMalloc((void**)&locvbo, vbosize*sizeof(float));
+
+	dBuff = NULL;
+	cudaMalloc((void**)&dBuff, int(cam.reso.x*cam.reso.y)*sizeof(float));
+
+	dBuffLock = NULL;
+	cudaMalloc((void**)&dBuffLock, int(cam.reso.x*cam.reso.y)*sizeof(int));
+
+	fragment frag;
+	frag.color = glm::vec3(0, 0, 0);
+	frag.normal = glm::vec3(0, 0, 0);
+	frag.position = glm::vec3(0, 0, -10000);
+	clearDepthBuffer << <fullBlocksPerGrid, threadsPerBlock >> >(cam, depthbuffer, frag, dBuff, dBuffLock);
+
+	tileSize = 32;
+	int primitiveBlocks = ceil(((float)vbosize / 3) / ((float)tileSize));
+
+	//------------------------------
+	//vertex shader
+	//------------------------------
+	vertexShadeKernel << <primitiveBlocks, tileSize >> >(cam, device_vbo, vbosize, device_nbo, nbosize, locvbo);
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//primitive assembly
+	//------------------------------
+	primitiveBlocks = ceil(((float)ibosize / 3) / ((float)tileSize));
+	primitiveAssemblyKernel << <primitiveBlocks, tileSize >> >(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives, locvbo, device_nbo,cam);
+
+	cudaDeviceSynchronize();
+
+#ifdef CullingFlag
+	thrust::device_ptr<triangle> primitive_first = thrust::device_pointer_cast(primitives);
+	thrust::device_ptr<triangle> primitive_last = thrust::remove_if(primitive_first, primitive_first + ibosize / 3, CFlagTrue());
+	printf("\rBefore Culling: %d\t", ibosize / 3);
+	int triCount = thrust::distance(primitive_first, primitive_last);
+	printf("After Culling: %d\t", triCount);
+	cudaDeviceSynchronize();
+#endif
+
+
+	//------------------------------
+	//rasterization
+	//------------------------------
+	rasterizationKernel << <primitiveBlocks, tileSize >> >(primitives, ibosize / 3, depthbuffer, cam, dBuff, dBuffLock);
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//fragment shader
+	//------------------------------
+	fragmentShadeKernel << <fullBlocksPerGrid, threadsPerBlock >> >(depthbuffer, resolution, glm::vec3(-10.0f));
+
+	cudaDeviceSynchronize();
+	//------------------------------
+	//write fragments to framebuffer
+	//------------------------------
+	render << <fullBlocksPerGrid, threadsPerBlock >> >(resolution, depthbuffer, framebuffer);
+#ifdef AA
+#endif
+	sendImageToPBO << <fullBlocksPerGrid, threadsPerBlock >> >(PBOpos, resolution, framebuffer);
+
+	cudaDeviceSynchronize();
+
+	kernelCleanup();
+	cudaEventRecord(stop, 0);
+	cudaEventSynchronize(stop);
+
+	float seconds = 0.0f;
+	cudaEventElapsedTime(&seconds, start, stop);
+
+	printf("One Loop time:  %f ms", seconds);
+	checkCUDAError("Kernel failed!");
 }
 
 void kernelCleanup(){
-  cudaFree( primitives );
-  cudaFree( device_vbo );
-  cudaFree( device_cbo );
-  cudaFree( device_ibo );
-  cudaFree( framebuffer );
-  cudaFree( depthbuffer );
+	cudaFree(primitives);
+	cudaFree(device_vbo);
+	cudaFree(device_cbo);
+	cudaFree(device_ibo);
+	cudaFree(framebuffer);
+	cudaFree(depthbuffer);
+	cudaFree(device_nbo);
+	cudaFree(locvbo);
+	cudaFree(dBuff);
+	cudaFree(dBuffLock);
 }
 
diff --git a/src/rasterizeKernels.h b/src/rasterizeKernels.h
index 784be17..2dd04e1 100644
--- a/src/rasterizeKernels.h
+++ b/src/rasterizeKernels.h
@@ -1,6 +1,6 @@
 // CIS565 CUDA Rasterizer: A simple rasterization pipeline for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
 // Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
-
+//#include "rasterizeTools.h"
 #ifndef RASTERIZEKERNEL_H
 #define RASTERIZEKERNEL_H
 
@@ -9,8 +9,9 @@
 #include <cuda.h>
 #include <cmath>
 #include "glm/glm.hpp"
+#include "rasterizeTools.h"
 
 void kernelCleanup();
-void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize);
+void cudaRasterizeCore(uchar4* pos, Camera cam, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize,float* nbo, int nbosize);
 
 #endif //RASTERIZEKERNEL_H
diff --git a/src/rasterizeTools.cu b/src/rasterizeTools.cu
new file mode 100644
index 0000000..7a953d9
--- /dev/null
+++ b/src/rasterizeTools.cu
@@ -0,0 +1,70 @@
+#include "rasterizeTools.h"
+#ifndef min(x,y)
+#define min(x,y) (x<y?x:y)
+#endif
+
+#ifndef max(x,y)
+#define max(x,y) (x>y?x:y)
+#endif
+
+//Multiplies a cudaMat4 matrix and a vec4
+__host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
+	glm::vec3 r(1, 1, 1);
+	r.x = (m.x.x*v.x) + (m.x.y*v.y) + (m.x.z*v.z) + (m.x.w*v.w);
+	r.y = (m.y.x*v.x) + (m.y.y*v.y) + (m.y.z*v.z) + (m.y.w*v.w);
+	r.z = (m.z.x*v.x) + (m.z.y*v.y) + (m.z.z*v.z) + (m.z.w*v.w);
+	return r;
+}
+
+//LOOK: finds the axis aligned bounding box for a given triangle
+__host__ __device__ void getAABBForTriangle(triangle tri, glm::vec3& minpoint, glm::vec3& maxpoint){
+	minpoint = glm::vec3(min(min(tri.p0.x, tri.p1.x), tri.p2.x),
+		min(min(tri.p0.y, tri.p1.y), tri.p2.y),
+		min(min(tri.p0.z, tri.p1.z), tri.p2.z));
+	maxpoint = glm::vec3(max(max(tri.p0.x, tri.p1.x), tri.p2.x),
+		max(max(tri.p0.y, tri.p1.y), tri.p2.y),
+		max(max(tri.p0.z, tri.p1.z), tri.p2.z));
+}
+
+//LOOK: calculates the signed area of a given triangle
+__host__ __device__ float calculateSignedArea(triangle tri){
+	return 0.5*((tri.p2.x - tri.p0.x)*(tri.p1.y - tri.p0.y) - (tri.p1.x - tri.p0.x)*(tri.p2.y - tri.p0.y));
+}
+
+//LOOK: helper function for calculating barycentric coordinates
+__host__ __device__ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, triangle tri){
+	triangle baryTri;
+	baryTri.p0 = glm::vec3(a, 0); baryTri.p1 = glm::vec3(b, 0); baryTri.p2 = glm::vec3(c, 0);
+	return calculateSignedArea(baryTri) / calculateSignedArea(tri);
+}
+
+//LOOK: calculates barycentric coordinates
+__host__ __device__ glm::vec3 calculateBarycentricCoordinate(triangle tri, glm::vec2 point){
+	float beta = calculateBarycentricCoordinateValue(glm::vec2(tri.p0.x, tri.p0.y), point, glm::vec2(tri.p2.x, tri.p2.y), tri);
+	float gamma = calculateBarycentricCoordinateValue(glm::vec2(tri.p0.x, tri.p0.y), glm::vec2(tri.p1.x, tri.p1.y), point, tri);
+	float alpha = 1.0 - beta - gamma;
+	return glm::vec3(alpha, beta, gamma);
+}
+
+//LOOK: checks if a barycentric coordinate is within the boundaries of a triangle
+__host__ __device__ bool isBarycentricCoordInBounds(glm::vec3 barycentricCoord){
+	return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
+		barycentricCoord.y >= 0.0 && barycentricCoord.y <= 1.0 &&
+		barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
+}
+
+//LOOK: for a given barycentric coordinate, return the corresponding z position on the triangle
+__host__ __device__ float getZAtCoordinate(glm::vec3 barycentricCoord, triangle tri){
+	return -(barycentricCoord.x*tri.p0.z + barycentricCoord.y*tri.p1.z + barycentricCoord.z*tri.p2.z);
+}
+
+//perspective view matrix ref: http://www.glprogramming.com/red/appendixf.html
+__host__ __device__ cudaMat4 myFrustum(float left, float right, float bottom, float top, float near, float far){
+	cudaMat4 R;
+	//each row should be:
+	R.x = glm::vec4(2 * near / (right - left), 0, (right + left) / (right - left), 0);
+	R.y = glm::vec4(0, 2 * near / (top - bottom), (top + bottom) / (top - bottom), 0);
+	R.z = glm::vec4(0, 0, -(far + near) / (far - near), -2 * far*near / (far - near));
+	R.w = glm::vec4(0, 0, -1, 0);
+	return R;
+}
\ No newline at end of file
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index e9b5dcc..8483f66 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -1,6 +1,9 @@
 // CIS565 CUDA Rasterizer: A simple rasterization pipeline for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
 // Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
 
+
+//Modified by chiwsy to fix some conflics with the CUDA compiler keeping reporting re-define problems.
+
 #ifndef RASTERIZETOOLS_H
 #define RASTERIZETOOLS_H
 
@@ -8,71 +11,130 @@
 #include "glm/glm.hpp"
 #include "utilities.h"
 #include "cudaMat4.h"
+#include "glm/gtc/matrix_transform.hpp"
 
 struct triangle {
-  glm::vec3 p0;
-  glm::vec3 p1;
-  glm::vec3 p2;
-  glm::vec3 c0;
-  glm::vec3 c1;
-  glm::vec3 c2;
+	glm::vec3 p0;
+	glm::vec3 p1;
+	glm::vec3 p2;
+	glm::vec3 c0;
+	glm::vec3 c1;
+	glm::vec3 c2;
+
+	glm::vec3 locp0;
+	glm::vec3 locp1;
+	glm::vec3 locp2;
+
+	glm::vec3 locn0;
+	glm::vec3 locn1;
+	glm::vec3 locn2;
+
+	bool CFlag;
 };
 
 struct fragment{
-  glm::vec3 color;
-  glm::vec3 normal;
-  glm::vec3 position;
+	glm::vec3 color;
+	glm::vec3 normal;
+	glm::vec3 position;
 };
 
+struct CFlagTrue{
+	__host__ __device__ bool operator()(const triangle tri){
+		return tri.CFlag;
+	}
+
+};
 //Multiplies a cudaMat4 matrix and a vec4
-__host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
-  glm::vec3 r(1,1,1);
-  r.x = (m.x.x*v.x)+(m.x.y*v.y)+(m.x.z*v.z)+(m.x.w*v.w);
-  r.y = (m.y.x*v.x)+(m.y.y*v.y)+(m.y.z*v.z)+(m.y.w*v.w);
-  r.z = (m.z.x*v.x)+(m.z.y*v.y)+(m.z.z*v.z)+(m.z.w*v.w);
-  return r;
-}
+__host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v);
 
 //LOOK: finds the axis aligned bounding box for a given triangle
-__host__ __device__ void getAABBForTriangle(triangle tri, glm::vec3& minpoint, glm::vec3& maxpoint){
-  minpoint = glm::vec3(min(min(tri.p0.x, tri.p1.x),tri.p2.x), 
-        min(min(tri.p0.y, tri.p1.y),tri.p2.y),
-        min(min(tri.p0.z, tri.p1.z),tri.p2.z));
-  maxpoint = glm::vec3(max(max(tri.p0.x, tri.p1.x),tri.p2.x), 
-        max(max(tri.p0.y, tri.p1.y),tri.p2.y),
-        max(max(tri.p0.z, tri.p1.z),tri.p2.z));
-}
+__host__ __device__ void getAABBForTriangle(triangle tri, glm::vec3& minpoint, glm::vec3& maxpoint);
 
 //LOOK: calculates the signed area of a given triangle
-__host__ __device__ float calculateSignedArea(triangle tri){
-  return 0.5*((tri.p2.x - tri.p0.x)*(tri.p1.y - tri.p0.y) - (tri.p1.x - tri.p0.x)*(tri.p2.y - tri.p0.y));
-}
+__host__ __device__ float calculateSignedArea(triangle tri);
 
 //LOOK: helper function for calculating barycentric coordinates
-__host__ __device__ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, triangle tri){
-  triangle baryTri;
-  baryTri.p0 = glm::vec3(a,0); baryTri.p1 = glm::vec3(b,0); baryTri.p2 = glm::vec3(c,0);
-  return calculateSignedArea(baryTri)/calculateSignedArea(tri);
-}
+__host__ __device__ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, triangle tri);
 
 //LOOK: calculates barycentric coordinates
-__host__ __device__ glm::vec3 calculateBarycentricCoordinate(triangle tri, glm::vec2 point){
-  float beta  = calculateBarycentricCoordinateValue(glm::vec2(tri.p0.x,tri.p0.y), point, glm::vec2(tri.p2.x,tri.p2.y), tri);
-  float gamma = calculateBarycentricCoordinateValue(glm::vec2(tri.p0.x,tri.p0.y), glm::vec2(tri.p1.x,tri.p1.y), point, tri);
-  float alpha = 1.0-beta-gamma;
-  return glm::vec3(alpha,beta,gamma);
-}
+__host__ __device__ glm::vec3 calculateBarycentricCoordinate(triangle tri, glm::vec2 point);
 
 //LOOK: checks if a barycentric coordinate is within the boundaries of a triangle
-__host__ __device__ bool isBarycentricCoordInBounds(glm::vec3 barycentricCoord){
-   return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
-          barycentricCoord.y >= 0.0 && barycentricCoord.y <= 1.0 &&
-          barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
-}
+__host__ __device__ bool isBarycentricCoordInBounds(glm::vec3 barycentricCoord);
 
 //LOOK: for a given barycentric coordinate, return the corresponding z position on the triangle
-__host__ __device__ float getZAtCoordinate(glm::vec3 barycentricCoord, triangle tri){
-  return -(barycentricCoord.x*tri.p0.z + barycentricCoord.y*tri.p1.z + barycentricCoord.z*tri.p2.z);
-}
+__host__ __device__ float getZAtCoordinate(glm::vec3 barycentricCoord, triangle tri);
+
+
+//Defined by chiwsy
+
+//perspective view matrix ref: http://www.glprogramming.com/red/appendixf.html
+__host__ __device__ cudaMat4 myFrustum(float left, float right, float bottom, float top, float near, float far);
+
+struct Camera
+{
+	glm::vec3 pos;
+	glm::vec3 view;
+	glm::vec3 up;
+	glm::vec2 fov;
+
+	//x for near and y for far
+	glm::vec2 depth;
+
+	glm::vec2 reso;
+	//glm::vec3 ScreenV;
+	//glm::vec3 ScreenH;
+	//cudaMat4 Frustum;
+
+	//Transform from world coordinates to clip coordinates;
+	
+	glm::mat4 PMat;
+	glm::mat4 local2WorldMat;
+	
+	glm::mat4 PMat_inv;
+	glm::mat4 World2LocalMat;
+
+	__host__ __device__ Camera(
+		glm::vec2 r = glm::vec2(0.0f, 0.0f),
+		glm::vec3 p = glm::vec3(0.0f, 0.0f, 3.0f),
+		glm::vec3 v = glm::vec3(0.0f, 0.0f, -1.0f),
+		glm::vec3 u = glm::vec3(0.0f, 1.0f, 0.0f),
+		glm::vec2 f = glm::vec2(45.0f, 45.0f),
+		glm::vec2 d = glm::vec2(1.0f, 1000.0f)
+		) :
+		pos(p),
+		view(glm::normalize(v)),
+		up(glm::normalize(u)),
+		fov(f),
+		depth(d),
+		reso(r)
+	{
+		//MVP matrix is calculated as:			MVP = M_{Cam}^{Clip} * M_{world}^{Cam} * M_{model}^{world}
+		//View-Clip matrix is calculated as:	VP = M_{Cam}^{Clip} * M_{world}^{Cam}
+		
+		//glm::mat4 ViewMat = 
+		////rotation matrix is defined colum by colum
+		////ViewMat[0] = glm::vec4(glm::normalize(glm::cross(view, up)), 0.0f);
+		////ViewMat[1] = glm::vec4(glm::normalize(up), 0.0f);
+		////ViewMat[2] = glm::vec4(glm::normalize(view), 0.0f);
+		//////translate
+		////ViewMat[3] = glm::vec4(pos, 1.0f);
+		////inverse to get world to local matrix
+		World2LocalMat = glm::lookAt(pos, view*depth.x + pos, up);
+		local2WorldMat = glm::inverse(World2LocalMat);
+		//glm::vec3 viewport = depth.x*view*glm::vec3(tan(fov.x*PI/180.f),tan(fov.y*PI/180.0f),0.0f);
+		PMat = glm::perspective(fov.y, float(reso.x / reso.y), depth.x, depth.y);//utilityCore::cudaMat4ToGlmMat4(myFrustum(-viewport.x, viewport.x, -viewport.y, viewport.y, depth.x, depth.y));
+		PMat_inv = glm::inverse(PMat);
+	}
+
+	__host__ __device__ void update(){
+		World2LocalMat = glm::lookAt(pos, view*depth.x + pos, up);
+		local2WorldMat = glm::inverse(World2LocalMat);
+		//glm::vec3 viewport = depth.x*view*glm::vec3(tan(fov.x*PI/180.f),tan(fov.y*PI/180.0f),0.0f);
+		PMat = glm::perspective(fov.y, float(reso.x / reso.y), depth.x, depth.y);//utilityCore::cudaMat4ToGlmMat4(myFrustum(-viewport.x, viewport.x, -viewport.y, viewport.y, depth.x, depth.y));
+		PMat_inv = glm::inverse(PMat);
+	}
+
+};
 
 #endif
\ No newline at end of file
diff --git a/src/utilities.cpp b/src/utilities.cpp
index 4f8f4d3..c3dbf54 100644
--- a/src/utilities.cpp
+++ b/src/utilities.cpp
@@ -57,6 +57,8 @@ bool utilityCore::epsilonCheck(float a, float b){
     }
 }
 
+
+
 void utilityCore::printCudaMat4(cudaMat4 m){
     utilityCore::printVec4(m.x);
     utilityCore::printVec4(m.y);
diff --git a/src/utilities.h b/src/utilities.h
index 3e6ef6e..ce61a05 100644
--- a/src/utilities.h
+++ b/src/utilities.h
@@ -27,6 +27,7 @@ namespace utilityCore {
     extern bool replaceString(std::string& str, const std::string& from, const std::string& to);
     extern glm::vec3 clampRGB(glm::vec3 color);
     extern bool epsilonCheck(float a, float b);
+	//extern __device__ bool DevepsilonCheck(float a, float b);
     extern std::vector<std::string> tokenizeString(std::string str); 
     extern cudaMat4 glmMat4ToCudaMat4(glm::mat4 a);
     extern glm::mat4 cudaMat4ToGlmMat4(cudaMat4 a);
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer.sln b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer.sln
index 0b3969a..fe1d06f 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer.sln
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer.sln
@@ -1,18 +1,26 @@
 ﻿
-Microsoft Visual Studio Solution File, Format Version 11.00
-# Visual Studio 2010
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 2013
+VisualStudioVersion = 12.0.30723.0
+MinimumVisualStudioVersion = 10.0.40219.1
 Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "PROJ4_Rasterizer", "PROJ4_Rasterizer\PROJ4_Rasterizer.vcxproj", "{6F351C80-6834-4F35-9CB3-C70F56161CBC}"
 EndProject
 Global
 	GlobalSection(SolutionConfigurationPlatforms) = preSolution
 		Debug|Win32 = Debug|Win32
+		Debug|x64 = Debug|x64
 		Release|Win32 = Release|Win32
+		Release|x64 = Release|x64
 	EndGlobalSection
 	GlobalSection(ProjectConfigurationPlatforms) = postSolution
 		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Debug|Win32.ActiveCfg = Debug|Win32
 		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Debug|Win32.Build.0 = Debug|Win32
+		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Debug|x64.ActiveCfg = Debug|x64
+		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Debug|x64.Build.0 = Debug|x64
 		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Release|Win32.ActiveCfg = Release|Win32
 		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Release|Win32.Build.0 = Release|Win32
+		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Release|x64.ActiveCfg = Release|x64
+		{6F351C80-6834-4F35-9CB3-C70F56161CBC}.Release|x64.Build.0 = Release|x64
 	EndGlobalSection
 	GlobalSection(SolutionProperties) = preSolution
 		HideSolutionNode = FALSE
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
index f640485..1e9c4bc 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
@@ -1,14 +1,22 @@
 ﻿<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
   <ItemGroup Label="ProjectConfigurations">
     <ProjectConfiguration Include="Debug|Win32">
       <Configuration>Debug</Configuration>
       <Platform>Win32</Platform>
     </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
     <ProjectConfiguration Include="Release|Win32">
       <Configuration>Release</Configuration>
       <Platform>Win32</Platform>
     </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
   </ItemGroup>
   <PropertyGroup Label="Globals">
     <ProjectGuid>{6F351C80-6834-4F35-9CB3-C70F56161CBC}</ProjectGuid>
@@ -19,23 +27,44 @@
     <ConfigurationType>Application</ConfigurationType>
     <UseDebugLibraries>true</UseDebugLibraries>
     <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v100</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v100</PlatformToolset>
   </PropertyGroup>
   <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
     <ConfigurationType>Application</ConfigurationType>
     <UseDebugLibraries>false</UseDebugLibraries>
     <WholeProgramOptimization>true</WholeProgramOptimization>
     <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v120</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v120</PlatformToolset>
   </PropertyGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
   <ImportGroup Label="ExtensionSettings">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.props" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
   </ImportGroup>
   <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
     <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
   </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
   <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
     <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
   </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
   <PropertyGroup Label="UserMacros" />
   <PropertyGroup />
   <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
@@ -49,6 +78,29 @@
       <AdditionalLibraryDirectories>$(SolutionDir)..\..\external\lib\win\GLFW;$(SolutionDir)..\..\external\lib\win\GL;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
       <AdditionalDependencies>cudart.lib;glew32s.lib;glfw3.lib;opengl32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>
     </Link>
+    <CudaCompile>
+      <CodeGeneration>compute_20,sm_20</CodeGeneration>
+      <KeepDir>$(CudaIntDir)</KeepDir>
+      <AdditionalOptions />
+    </CudaCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <AdditionalIncludeDirectories>$(SolutionDir)..\..\external\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalLibraryDirectories>$(SolutionDir)..\..\external\lib\win\GLFW;$(SolutionDir)..\..\external\lib\win\GL;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
+      <AdditionalDependencies>cudart.lib;glew32s.lib;glfw3.lib;opengl32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <CudaCompile>
+      <CodeGeneration>compute_20,sm_20</CodeGeneration>
+      <KeepDir>$(CudaIntDir)</KeepDir>
+      <AdditionalOptions>
+      </AdditionalOptions>
+    </CudaCompile>
   </ItemDefinitionGroup>
   <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
     <ClCompile>
@@ -66,8 +118,28 @@
       <AdditionalDependencies>cudart.lib;glew32s.lib;glfw3.lib;opengl32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>
     </Link>
   </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <AdditionalIncludeDirectories>$(SolutionDir)..\..\external\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <AdditionalLibraryDirectories>$(SolutionDir)..\..\external\lib\win\GLFW;$(SolutionDir)..\..\external\lib\win\GL;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
+      <AdditionalDependencies>cudart.lib;glew32s.lib;glfw3.lib;opengl32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <CudaCompile>
+      <TargetMachinePlatform>64</TargetMachinePlatform>
+    </CudaCompile>
+  </ItemDefinitionGroup>
   <ItemGroup>
     <ClInclude Include="..\..\..\src\cudaMat4.h" />
+    <ClInclude Include="..\..\..\src\Macros.h" />
     <ClInclude Include="..\..\..\src\main.h" />
     <ClInclude Include="..\..\..\src\rasterizeKernels.h" />
     <ClInclude Include="..\..\..\src\rasterizeTools.h" />
@@ -78,6 +150,7 @@
     <ClCompile Include="..\..\..\external\src\objUtil\obj.cpp" />
     <ClCompile Include="..\..\..\external\src\objUtil\objloader.cpp" />
     <ClCompile Include="..\..\..\src\main.cpp" />
+    <ClCompile Include="..\..\..\src\rasterizeTools.cu" />
     <ClCompile Include="..\..\..\src\utilities.cpp" />
   </ItemGroup>
   <ItemGroup>
@@ -87,6 +160,6 @@
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.targets" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
   </ImportGroup>
 </Project>
\ No newline at end of file
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj.filters b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj.filters
index 6a1d8cf..39c96c2 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj.filters
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj.filters
@@ -30,6 +30,9 @@
     <ClInclude Include="..\..\..\src\utilities.h">
       <Filter>Header Files</Filter>
     </ClInclude>
+    <ClInclude Include="..\..\..\src\Macros.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
   </ItemGroup>
   <ItemGroup>
     <ClCompile Include="..\..\..\src\main.cpp">
@@ -47,6 +50,9 @@
     <ClCompile Include="..\..\..\external\src\objUtil\objloader.cpp">
       <Filter>Source Files</Filter>
     </ClCompile>
+    <ClCompile Include="..\..\..\src\rasterizeTools.cu">
+      <Filter>Source Files</Filter>
+    </ClCompile>
   </ItemGroup>
   <ItemGroup>
     <CudaCompile Include="..\..\..\src\rasterizeKernels.cu">