diff --git a/README.md b/README.md
index ae0896a..96bb1af 100644
--- a/README.md
+++ b/README.md
@@ -1,184 +1,54 @@
--------------------------------------------------------------------------------
-CIS565: Project 4: CUDA Rasterizer
--------------------------------------------------------------------------------
+CIS 565 Project4 : CUDA Rasterizer
+===================
+
 Fall 2014
--------------------------------------------------------------------------------
-Due Monday 10/27/2014 @ 12 PM
--------------------------------------------------------------------------------
-
--------------------------------------------------------------------------------
-NOTE:
--------------------------------------------------------------------------------
-This project requires an NVIDIA graphics card with CUDA capability! Any card with CUDA compute capability 1.1 or higher will work fine for this project. For a full list of CUDA capable cards and their compute capability, please consult: http://developer.nvidia.com/cuda/cuda-gpus. If you do not have an NVIDIA graphics card in the machine you are working on, feel free to use any machine in the SIG Lab or in Moore100 labs. All machines in the SIG Lab and Moore100 are equipped with CUDA capable NVIDIA graphics cards. If this too proves to be a problem, please contact Patrick or Karl as soon as possible.
-
--------------------------------------------------------------------------------
-INTRODUCTION:
--------------------------------------------------------------------------------
-In this project, you will implement a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. In this project, you will implement vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and write the resulting fragments to a framebuffer. More information about the rasterized graphics pipeline can be found in the class slides and in your notes from CIS560.
-
-The basecode provided includes an OBJ loader and much of the mundane I/O and bookkeeping code. The basecode also includes some functions that you may find useful, described below. The core rasterization pipeline is left for you to implement.
-
-You MAY NOT use ANY raycasting/raytracing AT ALL in this project, EXCEPT in the fragment shader step. One of the purposes of this project is to see how a rasterization pipeline can generate graphics WITHOUT the need for raycasting! Raycasting may only be used in the fragment shader effect for interesting shading results, but is absolutely not allowed in any other stages of the pipeline.
-
-Also, you MAY NOT use OpenGL ANYWHERE in this project, aside from the given OpenGL code for drawing Pixel Buffer Objects to the screen. Use of OpenGL for any pipeline stage instead of your own custom implementation will result in an incomplete project.
-
-Finally, note that while this basecode is meant to serve as a strong starting point for a CUDA rasterizer, you are not required to use this basecode if you wish, and you may also change any part of the basecode specification as you please, so long as the final rendered result is correct.
-
--------------------------------------------------------------------------------
-CONTENTS:
--------------------------------------------------------------------------------
-The Project4 root directory contains the following subdirectories:
-	
-* src/ contains the source code for the project. Both the Windows Visual Studio solution and the OSX makefile reference this folder for all source; the base source code compiles on OSX and Windows without modification.
-* objs/ contains example obj test files: cow.obj, cube.obj, tri.obj.
-* renders/ contains an example render of the given example cow.obj file with a z-depth fragment shader. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies needed for building and running on Windows 7.
-
-The Windows and OSX versions of the project build and run exactly the same way as in Project0, Project1, and Project2.
-
--------------------------------------------------------------------------------
-REQUIREMENTS:
--------------------------------------------------------------------------------
-In this project, you are given code for:
-
-* A library for loading/reading standard Alias/Wavefront .obj format mesh files and converting them to OpenGL style VBOs/IBOs
-* A suggested order of kernels with which to implement the graphics pipeline
-* Working code for CUDA-GL interop
-
-You will need to implement the following stages of the graphics pipeline and features:
-
-* Vertex Shading
-* Primitive Assembly with support for triangle VBOs/IBOs
-* Perspective Transformation
-* Rasterization through either a scanline or a tiled approach
-* Fragment Shading
-* A depth buffer for storing and depth testing fragments
-* Fragment to framebuffer writing
-* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
-
-You are also required to implement at least 3 of the following features:
-
-* Additional pipeline stages. Each one of these stages can count as 1 feature:
-   * Geometry shader
-   * Transformation feedback
-   * Back-face culling
-   * Scissor test
-   * Stencil test
-   * Blending
-
-IMPORTANT: For each of these stages implemented, you must also add a section to your README stating what the expected performance impact of that pipeline stage is, and real performance comparisons between your rasterizer with that stage and without.
-
-* Correct color interpolation between points on a primitive
-* Texture mapping WITH texture filtering and perspective correct texture coordinates
-* Support for additional primitices. Each one of these can count as HALF of a feature.
-   * Lines
-   * Line strips
-   * Triangle fans
-   * Triangle strips
-   * Points
-* Anti-aliasing
-* Order-independent translucency using a k-buffer
-* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
-
--------------------------------------------------------------------------------
-BASE CODE TOUR:
--------------------------------------------------------------------------------
-You will be working primarily in two files: rasterizeKernel.cu, and rasterizerTools.h. Within these files, areas that you need to complete are marked with a TODO comment. Areas that are useful to and serve as hints for optional features are marked with TODO (Optional). Functions that are useful for reference are marked with the comment LOOK.
-
-* rasterizeKernels.cu contains the core rasterization pipeline. 
-	* A suggested sequence of kernels exists in this file, but you may choose to alter the order of this sequence or merge entire kernels if you see fit. For example, if you decide that doing has benefits, you can choose to merge the vertex shader and primitive assembly kernels, or merge the perspective transform into another kernel. There is not necessarily a right sequence of kernels (although there are wrong sequences, such as placing fragment shading before vertex shading), and you may choose any sequence you want. Please document in your README what sequence you choose and why.
-	* The provided kernels have had their input parameters removed beyond basic inputs such as the framebuffer. You will have to decide what inputs should go into each stage of the pipeline, and what outputs there should be. 
-
-* rasterizeTools.h contains various useful tools, including a number of barycentric coordinate related functions that you may find useful in implementing scanline based rasterization...
-	* A few pre-made structs are included for you to use, such as fragment and triangle. A simple rasterizer can be implemented with these structs as is. However, as with any part of the basecode, you may choose to modify, add to, use as-is, or outright ignore them as you see fit.
-	* If you do choose to add to the fragment struct, be sure to include in your README a rationale for why. 
-
-You will also want to familiarize yourself with:
-
-* main.cpp, which contains code that transfers VBOs/CBOs/IBOs to the rasterization pipeline. Interactive camera work will also have to be implemented in this file if you choose that feature.
-* utilities.h, which serves as a kitchen-sink of useful functions
-
--------------------------------------------------------------------------------
-SOME RESOURCES:
--------------------------------------------------------------------------------
-The following resources may be useful for this project:
-
-* High-Performance Software Rasterization on GPUs
-	* Paper (HPG 2011): http://www.tml.tkk.fi/~samuli/publications/laine2011hpg_paper.pdf
-	* Code: http://code.google.com/p/cudaraster/ Note that looking over this code for reference with regard to the paper is fine, but we most likely will not grant any requests to actually incorporate any of this code into your project.
-	* Slides: http://bps11.idav.ucdavis.edu/talks/08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
-* The Direct3D 10 System (SIGGRAPH 2006) - for those interested in doing geometry shaders and transform feedback.
-	* http://133.11.9.3/~takeo/course/2006/media/papers/Direct3D10_siggraph2006.pdf
-* Multi-Fragment Eﬀects on the GPU using the k-Buﬀer - for those who want to do a k-buffer
-	* http://www.inf.ufrgs.br/~comba/papers/2007/kbuffer_preprint.pdf
-* FreePipe: A Programmable, Parallel Rendering Architecture for Efficient Multi-Fragment Effects (I3D 2010)
-	* https://sites.google.com/site/hmcen0921/cudarasterizer
-* Writing A Software Rasterizer In Javascript:
-	* Part 1: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-1.html
-	* Part 2: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-2.html
-
--------------------------------------------------------------------------------
-NOTES ON GLM:
--------------------------------------------------------------------------------
-This project uses GLM, the GL Math library, for linear algebra. You need to know two important points on how GLM is used in this project:
-
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but pre-Fermi cards do not play nice with GLM matrices. As such, in this project, GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is provided as multiplyMV() in intersections.h.
-
--------------------------------------------------------------------------------
-README
--------------------------------------------------------------------------------
-All students must replace or augment the contents of this Readme.md in a clear 
-manner with the following:
-
-* A brief description of the project and the specific features you implemented.
-* At least one screenshot of your project running.
-* A 30 second or longer video of your project running.  To create the video you
-  can use http://www.microsoft.com/expression/products/Encoder4_Overview.aspx 
-* A performance evaluation (described in detail below).
-
--------------------------------------------------------------------------------
-PERFORMANCE EVALUATION
--------------------------------------------------------------------------------
-The performance evaluation is where you will investigate how to make your CUDA
-programs more efficient using the skills you've learned in class. You must have
-performed at least one experiment on your code to investigate the positive or
-negative effects on performance. 
-
-We encourage you to get creative with your tweaks. Consider places in your code
-that could be considered bottlenecks and try to improve them. 
-
-Each student should provide no more than a one page summary of their
-optimizations along with tables and or graphs to visually explain any
-performance differences.
-
--------------------------------------------------------------------------------
-THIRD PARTY CODE POLICY
--------------------------------------------------------------------------------
-* Use of any third-party code must be approved by asking on Piazza.  If it is approved, all students are welcome to use it.  Generally, we approve use of third-party code that is not a core part of the project.  For example, for the ray tracer, we would approve using a third-party library for loading models, but would not approve copying and pasting a CUDA function for doing refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another student's code, is an academic integrity violation, and will result in you receiving an F for the semester.
-
--------------------------------------------------------------------------------
-SELF-GRADING
--------------------------------------------------------------------------------
-* On the submission date, email your grade, on a scale of 0 to 100, to Liam, harmoli+cis565@seas.upenn.edu, with a one paragraph explanation.  Be concise and realistic.  Recall that we reserve 30 points as a sanity check to adjust your grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We hope to only use this in extreme cases when your grade does not realistically reflect your work - it is either too high or too low.  In most cases, we plan to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as the path tracer.  We will determine the weighting at the end of the semester based on the size of each project.
-
----
-SUBMISSION
----
-As with the previous project, you should fork this project and work inside of
-your fork. Upon completion, commit your finished project back to your fork, and
-make a pull request to the master repository.  You should include a README.md
-file in the root directory detailing the following
-
-* A brief description of the project and specific features you implemented
-* At least one screenshot of your project running.
-* A link to a video of your raytracer running.
-* Instructions for building and running your project if they differ from the
-  base code.
-* A performance writeup as detailed above.
-* A list of all third-party code used.
-* This Readme file edited as described above in the README section.
+
+Author: Dave Kotfis
+
+##Overview
+
+This is a GPU graphics pipeline implemented using CUDA. The pipeline is made up of the following stages:
+
+- Vertex Shader -> transforms points/normals using the Model-View-Projection matrix.
+- Primitive Assembly -> Constructs triangles from vertices and normals.
+- Culling -> Removes primitives that face away from the screen, or are entirely out of the field of view.
+- Geometry Shader -> Provides tesselation of primitives.
+- Rasterization -> Uses a scanline algorithm to turn primitives into fragments.
+- Fragment Shader -> Colors fragments using Phong shading.
+- Rendering -> Takes the front fragment from the depth buffer and stores in the frame buffer to show onscreen.
+
+
+##Progress
+
+The first feature that I could demonstrate was the rasterization step. I implemented a scanline algorithm based on http://www.sunshine2k.de/coding/java/TriangleRasterization/TriangleRasterization.html. This breaks up each triangle in 2, one with a flat bottom and the other with a flat top. Each of these triangles can be rasterized by iterating through each horizontal line, starting with the tip. To simplify the pipeline at this point, I rendered a single triangle with a single color. The result looked like this:
+
+<img src="https://raw.github.com/dkotfis/Project4-Rasterizer/master/renders/raster-triangle.png" "Rasterization Kernel">
+
+Next, I added the vertex shader, primitive assembly, and culling steps so I could render models. My setup keeps all models at the origin of world coordinates, and has a camera that points at the origin. This made it straightforward to later move the camera with mouse controls. The primitive assembly stage loads in a single normal for each triangle from the NBO. This normal would later be used in the fragment shader, so was also passed along through the rasterization stage onto the fragments. I added backface culling through Thrust by performing stream compaction on the primitives by checking the winding order of the vertices in image coordinates. The result of adding these steps in rendering a cow object looked like this:
+
+<img src="https://raw.github.com/dkotfis/Project4-Rasterizer/master/renders/cow_rasterize.png" "Vertex+Primitives">
+
+I then added phong shading to make the models appear 3 dimensional. This adds an ambient, diffuse, and specular component using a light source and a normal vector for each fragment. The result looks like this:
+
+<img src="https://raw.github.com/dkotfis/Project4-Rasterizer/master/renders/cow_phong.png" "Phong Shading">
+
+I next added camera control using the mouse with GLFW so the 3D models can be explored and examined from the application. The mouse control scheme treats the camera's position in spherical coordinates. Clicking and dragging the mouse in the X direction results in incrementally changing the azimuth angle, and the Y direction does the same for the polar angle. To avoid this exploration scheme from breaking down, the polar angle is bound between 0 and PI. Using the scroll wheel on the mouse moves the radial coordinate of the camera incrementally. I added speed factors that were determined experimentally to maximize usability. A video demonstrating camera control can be found in renders/cow_video.wmv.
+
+In cases where a single primitive is very large on screen, the rasterization step for the primitive can take a much longer time to rasterize into fragments. This seemed to be to primary performance bottleneck that I was running into. To solve this, I added a tesselation step after culling that can reduce a triangle into smaller sub-triangles. This can even out the workload for rasterization so that more triangles can be working on drawing the same region of space. I did this by tesselating the triangle by splitting it into 3 subtriangles using a new vertex at the center of the triangle. Additional tessellation is supported by recursing additional layers (implemented as a stack). Thus, a single triangle can be split into 3^N triangles.
+
+##Performance Analysis
+
+Backface Culling - I've found that removing backfacing primitives before rasterization has made very little impact on the rendering speed. For a standard cow object test case, the original 5,804 triangles are culled down by nearly half to 3,014 triangles. However, the performance impact is hardly noticeable, running at 40 FPS independently of whether or not this feature is turned on.
+
+Tesselation - I've compared the run speeds for rendering the cow model at different distances (same viewing angle). For all cases, the speed decreases when getting closer to the model since the rasterization becomes more work for each primitive. 1 level of tesselation increases the number of triangles by 3, which helps reduce the amount of work per triangle at close distances, but hurts at larger distances since more triangles must be executed and each does very little work. Tesselating 2 levels to create 27 times more triangles runs slowly at all distances tested.
+
+<img src="https://raw.github.com/dkotfis/Project4-Rasterizer/master/renders/tesselation_impact.png" "Tesselation Performance">
+
+##Future
+
+- I've notices edge cases where my rasterization algorithm creates bleeding scanlines. This happens when one too many scanlines are generated, due to rounding, and when the triangle slopes are very high.
+- I don't currently support loading in and using materials. The color and reflection properties are just hard coded into the fragment shader.
+- Writing to the depth buffer within rasterization isn't actually thread-safe. I have minimized the distance between the read and write steps, but am not using any locks or atomic functions. I've found that the cases where this causes an issue are rare enough that it has been a low priority to solve.
+- Adaptive level of detail for the tesselation process would be extremely beneficial. Right now, a constant tesselation level applied to all triangles reduces the performance while running with a model far away to boost the performance of the model up close. To get the best of both worlds, it would be better to determine the onscreen area associated with a triangle, and tesselate until the sizes are under an appropriate threshold. This would be complicated to implement, since I would have to do more dynamic GPU memory allocation from within kernels, or allocate extra space up-front and use stream compaction to reduce it afterwards.
+- More sophisticated tesselation methods would be valuable. Mine is simple and fast, but it can only tesselate in powers of 3. There are places where more granularity will be necessary to achieve good performance.
 
diff --git a/external/src/objUtil/obj.cpp b/external/src/objUtil/obj.cpp
index 22a33aa..c4320eb 100644
--- a/external/src/objUtil/obj.cpp
+++ b/external/src/objUtil/obj.cpp
@@ -18,7 +18,7 @@ obj::obj(){
 	defaultColor = glm::vec3(0,0,0);
 	boundingbox = new float[32];
 	maxminSet = false;
-	xmax=0; xmin=0; ymax=0; ymin=0; zmax=0; zmin=0; 
+	xmax=0; xmin=0; ymax=0; ymin=0; zmax=0; zmin=0;
 	
 }
 
@@ -52,7 +52,7 @@ void obj::buildVBOs(){
 			for(int i=2; i<face.size(); i++){
 				glm::vec4 p1 = points[face[i-1]];
 				glm::vec4 p2 = points[face[i]];
-				VBOvec.push_back(p0[0]) ; VBOvec.push_back(p0[1]); VBOvec.push_back(p0[2]); //VBOvec.push_back(1.0f);
+				VBOvec.push_back(p0[0]); VBOvec.push_back(p0[1]); VBOvec.push_back(p0[2]); //VBOvec.push_back(1.0f);
 				VBOvec.push_back(p1[0]); VBOvec.push_back(p1[1]); VBOvec.push_back(p1[2]); //VBOvec.push_back(1.0f);
 				VBOvec.push_back(p2[0]); VBOvec.push_back(p2[1]); VBOvec.push_back(p2[2]); //VBOvec.push_back(1.0f);
 
diff --git a/objs/tri.obj b/objs/tri.obj
index fb38e35..544e699 100644
--- a/objs/tri.obj
+++ b/objs/tri.obj
@@ -1,6 +1,6 @@
 v 0 0 0
 v 0.1 0 0
-v 0 0.1 0
+v 0.1 0.1 0
 
 vn 0 0 1
 
diff --git a/renders/Thumbs.db b/renders/Thumbs.db
new file mode 100644
index 0000000..fec051d
Binary files /dev/null and b/renders/Thumbs.db differ
diff --git a/renders/cow_diffuse_good_2.png b/renders/cow_diffuse_good_2.png
new file mode 100644
index 0000000..51b784b
Binary files /dev/null and b/renders/cow_diffuse_good_2.png differ
diff --git a/renders/cow_phong.png b/renders/cow_phong.png
new file mode 100644
index 0000000..fce2f81
Binary files /dev/null and b/renders/cow_phong.png differ
diff --git a/renders/cow_phong_diffuse.png b/renders/cow_phong_diffuse.png
new file mode 100644
index 0000000..9c91f14
Binary files /dev/null and b/renders/cow_phong_diffuse.png differ
diff --git a/renders/cow_phong_diffuse2.png b/renders/cow_phong_diffuse2.png
new file mode 100644
index 0000000..a51ccc7
Binary files /dev/null and b/renders/cow_phong_diffuse2.png differ
diff --git a/renders/cow_phong_diffuse_fixed.png b/renders/cow_phong_diffuse_fixed.png
new file mode 100644
index 0000000..791840c
Binary files /dev/null and b/renders/cow_phong_diffuse_fixed.png differ
diff --git a/renders/cow_phong_diffuse_good.png b/renders/cow_phong_diffuse_good.png
new file mode 100644
index 0000000..15f17d9
Binary files /dev/null and b/renders/cow_phong_diffuse_good.png differ
diff --git a/renders/cow_phong_pretty.png b/renders/cow_phong_pretty.png
new file mode 100644
index 0000000..522d924
Binary files /dev/null and b/renders/cow_phong_pretty.png differ
diff --git a/renders/cow_rasterize.png b/renders/cow_rasterize.png
new file mode 100644
index 0000000..7aac3af
Binary files /dev/null and b/renders/cow_rasterize.png differ
diff --git a/renders/cow_video.wmv b/renders/cow_video.wmv
new file mode 100644
index 0000000..95685c5
Binary files /dev/null and b/renders/cow_video.wmv differ
diff --git a/renders/diffuse_cube.png b/renders/diffuse_cube.png
new file mode 100644
index 0000000..db63e96
Binary files /dev/null and b/renders/diffuse_cube.png differ
diff --git a/renders/raster-triangle.png b/renders/raster-triangle.png
new file mode 100644
index 0000000..0a98136
Binary files /dev/null and b/renders/raster-triangle.png differ
diff --git a/renders/tesselation_impact.png b/renders/tesselation_impact.png
new file mode 100644
index 0000000..b57c0f1
Binary files /dev/null and b/renders/tesselation_impact.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 13d8e67..598df5e 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -2,6 +2,7 @@
 // Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
 
 #include "main.h"
+#include "glm/gtc/matrix_transform.hpp"
 
 //-------------------------------
 //-------------MAIN--------------
@@ -9,6 +10,8 @@
 
 int main(int argc, char** argv){
 
+
+  /*
   bool loadedScene = false;
   for(int i=1; i<argc; i++){
     string header; string data;
@@ -23,12 +26,44 @@ int main(int argc, char** argv){
       loadedScene = true;
     }
   }
+  */
+
+  // Initialize the model
+  bool loadedScene = false;
+  string data = std::string("C:\\\\Users\\Dave\\Documents\\Github\\Project4-Rasterizer\\objs\\cow.obj");
+  mesh = new obj();
+  objLoader* loader = new objLoader(data, mesh);
+  mesh->buildVBOs();
+  delete loader;
+  loadedScene = true;
 
   if(!loadedScene){
     cout << "Usage: mesh=[obj file]" << endl;
     return 0;
   }
 
+  // Initialize the camera
+  camera_distance = 3.0f;
+  camera_phi = PI/2.0f;
+  camera_theta = 3.0f*PI/2.0f;
+
+  // Initialize the MVP matrix
+  glm::mat4 model = glm::translate(glm::mat4(), glm::vec3(0.0f, 0.0f, 0.0f)) * glm::scale(glm::mat4(), glm::vec3(1.0f));
+  glm::vec3 eye(camera_distance*sin(camera_phi)*cos(camera_theta), camera_distance*cos(camera_phi), camera_distance*sin(camera_phi)*sin(camera_theta));
+  glm::vec3 origin(0.0f, 0.0f, 0.0f);
+  glm::vec3 up(0.0f, 1.0f, 0.0f);
+  glm::mat4 view = glm::lookAt(eye, origin, up);
+  glm::mat4 projection = glm::perspective(45.0f, (float)(width / height), 0.1f, 100.0f);
+  mvp = projection * view * model;
+
+  // Initialize the light source (undirected)
+  light.origin = glm::vec3(-15.0f, -15.0f, -15.0f);
+  light.color = glm::vec3(1.0f, 1.0f, 1.0f);
+  glm::vec4 light_temp = mvp*glm::vec4(light.origin, 1.0f);
+  light.origin.x = light_temp.x/light_temp.w;
+  light.origin.y = light_temp.y/light_temp.w;
+  light.origin.z = light_temp.z/light_temp.w;
+
   frame = 0;
   seconds = time (NULL);
   fpstracker = 0;
@@ -93,13 +128,26 @@ void runCuda(){
   ibo = mesh->getIBO();
   ibosize = mesh->getIBOsize();
 
+  nbo = mesh->getNBO();
+  nbosize = mesh->getNBOsize();
+
+  // Update the MVP matrix
+  glm::mat4 model = glm::translate(glm::mat4(), glm::vec3(0.0f, 0.0f, 0.0f)) * glm::scale(glm::mat4(), glm::vec3(1.0f));
+  glm::vec3 eye(camera_distance*sin(camera_phi)*cos(camera_theta), camera_distance*cos(camera_phi), camera_distance*sin(camera_phi)*sin(camera_theta));
+  glm::vec3 origin(0.0f, 0.0f, 0.0f);
+  glm::vec3 up(0.0f, 1.0f, 0.0f);
+  glm::mat4 view = glm::lookAt(eye, origin, up);
+  glm::mat4 projection = glm::perspective(45.0f, (float)(width / height), 0.1f, 100.0f);
+  mvp = projection * view * model;
+
   cudaGLMapBufferObject((void**)&dptr, pbo);
-  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize);
+  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, light, vbo, vbosize, cbo, cbosize, ibo, ibosize, nbo, nbosize, mvp);
   cudaGLUnmapBufferObject(pbo);
 
   vbo = NULL;
   cbo = NULL;
   ibo = NULL;
+  nbo = NULL;
 
   frame++;
   fpstracker++;
@@ -126,6 +174,9 @@ bool init(int argc, char* argv[]) {
   }
   glfwMakeContextCurrent(window);
   glfwSetKeyCallback(window, keyCallback);
+  glfwSetMouseButtonCallback(window, (GLFWmousebuttonfun)mouseButtonCallback);
+  glfwSetCursorPosCallback(window, (GLFWcursorposfun)mousePositionCallback);
+  glfwSetScrollCallback(window, (GLFWscrollfun)mouseScrollCallback);
 
   // Set up GL context
   glewExperimental = GL_TRUE;
@@ -281,4 +332,39 @@ void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods
     if(key == GLFW_KEY_ESCAPE && action == GLFW_PRESS){
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
+}
+
+void mouseButtonCallback(GLFWwindow* window, int button, int action) {
+
+  if (button == GLFW_MOUSE_BUTTON_LEFT) {
+    mouse_pressed = (action == GLFW_PRESS) ? true : false;
+    if (!mouse_pressed) {
+      x_last = -1;
+      y_last = -1;
+    }
+  }
+
+}
+
+void mousePositionCallback(GLFWwindow* window, double x, double y) {
+
+  if (mouse_pressed) {
+    if (x_last == -1) {
+      x_last = x;
+      y_last = y;
+      return;
+    }
+    camera_theta -= (x-x_last)*0.0005;
+    camera_phi += (y - y_last)*0.0005;
+    if (camera_phi <= 0.0005f) {
+      camera_phi = 0.0005f;
+    } else if (camera_phi >= PI-0.0005f) {
+      camera_phi = PI-0.0005f;
+    }
+  }
+
+}
+
+void mouseScrollCallback(GLFWwindow* window, double xoffset, double yoffset) {
+  camera_distance -= yoffset*0.1f;
 }
\ No newline at end of file
diff --git a/src/main.h b/src/main.h
index 8999110..f440690 100644
--- a/src/main.h
+++ b/src/main.h
@@ -21,6 +21,7 @@
 
 
 #include "rasterizeKernels.h"
+#include "sceneStructs.h"
 #include "utilities.h"
 
 using namespace std;
@@ -40,7 +41,13 @@ GLuint displayImage;
 uchar4 *dptr;
 
 GLFWwindow *window;
+bool mouse_pressed = false;
+double x_last = -1;
+double y_last = -1;
 
+//--------------------------------
+//-----------MODEL STUFF----------
+//--------------------------------
 obj* mesh;
 
 float* vbo;
@@ -49,6 +56,17 @@ float* cbo;
 int cbosize;
 int* ibo;
 int ibosize;
+float* nbo;
+int nbosize;
+
+//-------------------------------
+//----------CAMERA STUFF---------
+//-------------------------------
+glm::mat4 mvp;
+ray light;
+float camera_distance;
+float camera_phi;
+float camera_theta;
 
 //-------------------------------
 //----------CUDA STUFF-----------
@@ -99,5 +117,7 @@ void deleteTexture(GLuint* tex);
 void mainLoop();
 void errorCallback(int error, const char *description);
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods);
-
+void mouseButtonCallback(GLFWwindow* window, int button, int action);
+void mousePositionCallback(GLFWwindow* window, double x, double y);
+void mouseScrollCallback(GLFWwindow* window, double xoffset, double yoffset);
 #endif
\ No newline at end of file
diff --git a/src/rasterizeKernels.cu b/src/rasterizeKernels.cu
index 10b0000..226b30d 100644
--- a/src/rasterizeKernels.cu
+++ b/src/rasterizeKernels.cu
@@ -5,6 +5,8 @@
 #include <cuda.h>
 #include <cmath>
 #include <thrust/random.h>
+#include <thrust/device_ptr.h>
+#include <thrust/copy.h>
 #include "rasterizeKernels.h"
 #include "rasterizeTools.h"
 
@@ -13,7 +15,10 @@ fragment* depthbuffer;
 float* device_vbo;
 float* device_cbo;
 int* device_ibo;
-triangle* primitives;
+float* device_nbo;
+triangle* primitives1;
+triangle* primitives2;
+triangle* primitives3;
 
 void checkCUDAError(const char *msg) {
   cudaError_t err = cudaGetLastError();
@@ -34,16 +39,38 @@ __host__ __device__ unsigned int hash(unsigned int a){
     return a;
 }
 
+//Handy function for clamping between two values;
+__host__ __device__ float clamp(float val, float min, float max) {
+  float result = val;
+  if (val < min) {
+    val = min;
+  } else if (val > max) {
+    val = max;
+  }
+  return val;
+}
+
+//Handy function for reflection
+__host__ __device__ glm::vec3 reflect(glm::vec3 vec_in, glm::vec3 norm) {
+  return (vec_in - 2.0f*glm::dot(vec_in, norm)*norm);
+}
+
 //Writes a given fragment to a fragment buffer at a given location
-__host__ __device__ void writeToDepthbuffer(int x, int y, fragment frag, fragment* depthbuffer, glm::vec2 resolution){
+/*__device__ void writeToDepthbuffer(int x, int y, fragment frag, fragment* depthbuffer, glm::vec2 resolution){
+  int index = x + (y * resolution.x);
   if(x<resolution.x && y<resolution.y){
-    int index = (y*resolution.x) + x;
+    //Depth comparison
+    if (frag.position.z > 0.0f || frag.position.z < depthbuffer[index].position.z) {
+      return;
+    }
+
+    //Add the new fragment to the buffer
     depthbuffer[index] = frag;
   }
-}
+}*/
 
-//Reads a fragment from a given location in a fragment buffer
-__host__ __device__ fragment getFromDepthbuffer(int x, int y, fragment* depthbuffer, glm::vec2 resolution){
+//Reads the fragment from a given location in a depth buffer
+/*__host__ __device__ fragment getFromDepthbuffer(int x, int y, fragment* depthbuffer, glm::vec2 resolution){
   if(x<resolution.x && y<resolution.y){
     int index = (y*resolution.x) + x;
     return depthbuffer[index];
@@ -51,7 +78,7 @@ __host__ __device__ fragment getFromDepthbuffer(int x, int y, fragment* depthbuf
     fragment f;
     return f;
   }
-}
+}*/
 
 //Writes a given pixel to a pixel buffer at a given location
 __host__ __device__ void writeToFramebuffer(int x, int y, glm::vec3 value, glm::vec3* framebuffer, glm::vec2 resolution){
@@ -81,17 +108,17 @@ __global__ void clearImage(glm::vec2 resolution, glm::vec3* image, glm::vec3 col
     }
 }
 
-//Kernel that clears a given fragment buffer with a given fragment
+//Kernel that clears a given depth buffer
 __global__ void clearDepthBuffer(glm::vec2 resolution, fragment* buffer, fragment frag){
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * resolution.x);
-    if(x<=resolution.x && y<=resolution.y){
-      fragment f = frag;
-      f.position.x = x;
-      f.position.y = y;
-      buffer[index] = f;
-    }
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * resolution.x);
+  if(x<=resolution.x && y<=resolution.y){
+    fragment f = frag;
+    f.position.x = x;
+    f.position.y = y;
+    buffer[index] = f;
+	}
 }
 
 //Kernel that writes the image to the OpenGL PBO directly. 
@@ -128,34 +155,389 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
   }
 }
 
-//TODO: Implement a vertex shader
-__global__ void vertexShadeKernel(float* vbo, int vbosize){
+//A vertex shader that does model-view-projection transformation
+__global__ void vertexShadeKernel(float* vbo, int vbosize, float* nbo, int nbosize, glm::mat4 mvp){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<vbosize/3){
+    //Pull together a vertex
+    glm::vec4 vert(vbo[3*index], vbo[3*index+1], vbo[3*index+2], 1.0f);
+
+    //Transform the vertex
+    glm::vec4 new_vert = mvp*vert;
+
+    //Store it back in the vbo
+    vbo[3*index] = new_vert.x/new_vert.w;
+    vbo[3*index+1] = new_vert.y/new_vert.w;
+    vbo[3*index+2] = new_vert.z/new_vert.w;
+
+    //Pull together a normal
+    glm::vec4 norm(nbo[3*index], nbo[3*index+1], nbo[3*index+2], 1.0f);
+
+    //Transform the normal
+    glm::vec4 new_norm = mvp*norm;
+
+    //Store it back in the nbp
+    nbo[3*index] = new_norm.x/new_norm.w;
+    nbo[3*index+1] = new_norm.y/new_norm.w;
+    nbo[3*index+2] = new_norm.z/new_norm.w;
   }
 }
 
-//TODO: Implement primative assembly
-__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives){
+//Implement primative assembly
+__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, triangle* primitives){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   int primitivesCount = ibosize/3;
   if(index<primitivesCount){
+	  // Add vertices
+    glm::vec3 p0, p1, p2;
+    p0.x = vbo[3 * ibo[3 * index]];
+    p0.y = vbo[3 * ibo[3 * index] + 1];
+    p0.z = vbo[3 * ibo[3 * index] + 2];
+    p1.x = vbo[3 * ibo[3 * index + 1]];
+    p1.y = vbo[3 * ibo[3 * index + 1] + 1];
+    p1.z = vbo[3 * ibo[3 * index + 1] + 2];
+    p2.x = vbo[3 * ibo[3 * index + 2]];
+    p2.y = vbo[3 * ibo[3 * index + 2] + 1];
+    p2.z = vbo[3 * ibo[3 * index + 2] + 2];
+    primitives[index].p0 = p0;
+    primitives[index].p1 = p1;
+    primitives[index].p2 = p2;
+
+	  // Add color
+	  primitives[index].c0.x = cbo[3*ibo[3*index]];
+	  primitives[index].c0.y = cbo[3*ibo[3*index]+1];
+	  primitives[index].c0.z = cbo[3*ibo[3*index]+2];
+	  primitives[index].c1.x = cbo[3*ibo[3*index+1]];
+	  primitives[index].c1.y = cbo[3*ibo[3*index+1]+1];
+	  primitives[index].c1.z = cbo[3*ibo[3*index+1]+2];
+	  primitives[index].c2.x = cbo[3*ibo[3*index+2]];
+	  primitives[index].c2.y = cbo[3*ibo[3*index+2]+1];
+	  primitives[index].c2.z = cbo[3*ibo[3*index+2]+2];
+
+    // Add normal
+    primitives[index].n.x = (nbo[3 * ibo[3 * index]] + nbo[3 * ibo[3 * index+1]] + nbo[3 * ibo[3 * index+2]])/3.0f;
+    primitives[index].n.y = (nbo[3 * ibo[3 * index] + 1] + nbo[3 * ibo[3 * index+1] + 1] + nbo[3 * ibo[3 * index+2] + 1])/3.0f;
+    primitives[index].n.z = (nbo[3 * ibo[3 * index] + 2] + nbo[3 * ibo[3 * index+1] + 2] + nbo[3 * ibo[3 * index+2] + 2])/3.0f;
+    primitives[index].n = glm::normalize(primitives[index].n);
+  }
+}
+
+//Backface culling function
+__host__ __device__ bool isTriangleBackfacing(triangle tri) {
+  float x1 = tri.p1.x - tri.p0.x;
+  float y1 = tri.p1.y - tri.p0.y;
+  float x2 = tri.p2.x - tri.p0.x;
+  float y2 = tri.p2.y - tri.p0.y;
+
+  return ((x1*y2 - y1*x2) > 0.0f);
+}
+
+//Thrust predicate for triangle removal
+struct check_triangle {
+  __host__ __device__
+    bool operator() (const triangle& t) {
+      bool back = isTriangleBackfacing(t);
+      bool p0 = (t.p0.x < -1.0f || t.p0.x > 1.0f) || (t.p0.x < -1.0f || t.p0.x > 1.0f) || (t.p0.z < 0.0f);
+      bool p1 = (t.p1.x < -1.0f || t.p1.x > 1.0f) || (t.p1.x < -1.0f || t.p1.x > 1.0f) || (t.p1.z < 0.0f);
+      bool p2 = (t.p2.x < -1.0f || t.p2.x > 1.0f) || (t.p2.x < -1.0f || t.p2.x > 1.0f) || (t.p2.z < 0.0f);
+      bool outside = (p0 && p1 && p2);
+      return !(back || outside);
+    }
+};
+
+//Kernel to trim primitives before rasterization
+__host__ void culling(triangle* primitives, triangle* new_primitives, int& numPrimitives) {
+  thrust::device_ptr<triangle> in = thrust::device_pointer_cast<triangle>(primitives);
+  thrust::device_ptr<triangle> out = thrust::device_pointer_cast<triangle>(new_primitives);
+  numPrimitives = thrust::copy_if(in, in + numPrimitives, out, check_triangle()) - out;
+}
+
+//Handy function that splits a triangle into 3 sub-triangles
+__host__ __device__ void splitTriangle(triangle* in, triangle* out) {
+  glm::vec3 center_pt = (in->p0 + in->p1 + in->p2)/3.0f;
+
+  //Create triangle 0
+  out[0].p0 = in->p0;
+  out[0].p1 = in->p1;
+  out[0].p2 = center_pt;
+  out[0].n = in->n;
+
+  //Create triangle 1
+  out[1].p0 = in->p1;
+  out[1].p1 = in->p2;
+  out[1].p2 = center_pt;
+  out[1].n = in->n;
+
+  //Create triangle 2
+  out[2].p0 = in->p2;
+  out[2].p1 = in->p0;
+  out[2].p2 = center_pt;
+  out[2].n = in->n;
+
+  //TODO: Pass/interpolate vertex colors
+}
+
+//Tesselation kernel
+__global__ void geometryShadeKernel(triangle* primitives, triangle* new_primitives, int primitivesCount, int tesselationLevel) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index<primitivesCount){
+    //Create local memory to create triangles in
+    int size = (int)pow(3.0f, (float)tesselationLevel);
+    triangle* these_tris1 = (triangle*) malloc(size*sizeof(triangle));
+    triangle* these_tris2 = (triangle*) malloc(size*sizeof(triangle));
+    these_tris1[0] = primitives[index];
+
+    //Loop through tesselation levels
+    for (int i = 0; i < tesselationLevel; i++) {
+      int currentNumTris = (int)pow(3.0f, (float)i);
+
+      //Loop through triangles in the previous level and create new ones
+      for (int j = 0; j < currentNumTris; j++) {
+        splitTriangle(&(these_tris1[j]), &(these_tris2[3*j]));
+      }
+
+      //Swap the local memory for the next level
+      triangle* temp = these_tris1;
+      these_tris1 = these_tris2;
+      these_tris2 = temp;
+    }
+
+    //Save off the result into global memory
+    memcpy(&(new_primitives[index*size]), these_tris1, size*sizeof(triangle));
+
+    //Delete local memory
+    free(these_tris1);
+    free(these_tris2);
+  }
+}
+
+//Function for sorting a triangle's vertices in ascending y
+__host__ __device__ void sortTriangleOnY(triangle in, triangle& out) {
+  bool i01 = (in.p0.y < in.p1.y);
+  bool i02 = (in.p0.y < in.p2.y);
+  bool i12 = (in.p1.y < in.p2.y);
+  if (i01 && i02) {
+    //0 is min
+    out.p0 = in.p0;
+    if (i12) {
+      out.p1 = in.p1;
+      out.p2 = in.p2;
+    }
+    else {
+      out.p1 = in.p2;
+      out.p2 = in.p1;
+    }
+  }
+  else if (!i01 && i12) {
+    //1 is min
+    out.p0 = in.p1;
+    if (i02) {
+      out.p1 = in.p0;
+      out.p2 = in.p2;
+    }
+    else {
+      out.p1 = in.p2;
+      out.p2 = in.p0;
+    }
+  }
+  else {
+    //2 is min
+    out.p0 = in.p2;
+    if (i01) {
+      out.p1 = in.p0;
+      out.p2 = in.p1;
+    }
+    else {
+      out.p1 = in.p1;
+      out.p2 = in.p0;
+    }
+  }
+}
+
+__host__ __device__ void fillBottomTriangle(triangle t, fragment* depthbuffer, glm::vec2 resolution) {
+
+  float denom = (t.p2.y - t.p0.y);
+  float invslope1x = (abs(denom)>=1.0f) ? (t.p2.x - t.p0.x) / denom : 0.0f;
+  float invslope1z = (abs(denom)>=1.0f) ? (t.p2.z - t.p0.z) / denom : 0.0f;
+  float invslope2x = (abs(denom)>=1.0f) ? (t.p2.x - t.p1.x) / denom : 0.0f;
+  float invslope2z = (abs(denom)>=1.0f) ? (t.p2.z - t.p1.z) / denom : 0.0f;
+
+  float curx1 = t.p2.x;
+  float curx2 = t.p2.x;
+  float curz1 = t.p2.z;
+  float curz2 = t.p2.z;
+
+  //Loop over scanlines from bottom to top
+  for (int y = ((int) (t.p2.y+0.5f)); y >= ((int) (t.p0.y+0.5f)); y--) {
+
+    curx1 -= invslope1x;
+    curx2 -= invslope2x;
+    curz1 -= invslope1z;
+    curz2 -= invslope2z;
+
+    float z = curz1;
+    denom = (curx2 - curx1);
+    float invslopezx = (abs(denom) >= 0.75f) ? (curz2 - curz1) / denom : 0.0f;
+
+    //Draw scanline
+    for (int x = (int) (min(curx1,curx2)+0.5f); x <= (int) (max(curx1,curx2)+0.5f); x++) {
+      //Make sure the pixel is in the screen
+      if (x < 0 || x >= resolution.x || y < 0 || y >= resolution.y) {
+        continue;
+      }
+
+      //Make the fragment
+      fragment frag;
+      frag.position.x = ((float)x) * 2.0 / resolution.x + 1.0f;
+      frag.position.y = ((float)y) * 2.0 / resolution.y + 1.0f;
+      frag.position.z = z;
+      frag.normal = t.n;
+
+      //Depth comparison
+      int index = x + (y * resolution.x);
+      if (z < 0.0f || z < depthbuffer[index].position.z) {
+        continue;
+      }
+      
+      //Add the new fragment to the buffer
+      depthbuffer[index] = frag;
+
+      z -= invslopezx;
+    }
+
+  }
+}
+
+__host__ __device__ void fillTopTriangle(triangle t, fragment* depthbuffer, glm::vec2 resolution) {
+
+  float denom = (t.p1.y - t.p0.y);
+  float invslope1x = (abs(denom)>=1.0f) ? (t.p1.x - t.p0.x) / denom : 0.0f;
+  float invslope1z = (abs(denom)>=1.0f) ? (t.p1.z - t.p0.z) / denom : 0.0f;
+  float invslope2x = (abs(denom)>=1.0f) ? (t.p2.x - t.p0.x) / denom : 0.0f;
+  float invslope2z = (abs(denom)>=1.0f) ? (t.p2.z - t.p0.z) / denom : 0.0f;
+
+  float curx1 = t.p0.x;
+  float curx2 = t.p0.x;
+  float curz1 = t.p0.z;
+  float curz2 = t.p0.z;
+
+  for (int y = ((int) (t.p0.y+0.5f)); y <= ((int) (t.p2.y+0.5f)); y++) {
+    float z = curz1;
+    denom = (curx2 - curx1);
+    float invslopezx = (abs(denom) >= 0.75f) ? (curz2 - curz1) / denom : 0.0f;
+
+    for (int x = (int) (min(curx1,curx2)+0.5f); x <= (int) (max(curx1,curx2)+0.5f); x++) {
+      //Make sure the pixel is in the screen
+      if (x < 0 || x >= resolution.x || y < 0 || y >= resolution.y) {
+        continue;
+      }
+
+      //Make the fragment
+      fragment frag;
+      frag.position.x = ((float)x) * 2.0 / resolution.x + 1.0f;
+      frag.position.y = ((float)y) * 2.0 / resolution.y + 1.0f;
+      frag.position.z = z;
+      frag.normal = t.n;
+
+      //Depth comparison
+      int index = x + (y * resolution.x);
+      if (z < 0.0f || z < depthbuffer[index].position.z) {
+        continue;
+      }
+
+      //Add the new fragment to the buffer
+      depthbuffer[index] = frag;
+
+      z += invslopezx;
+    }
+    
+    curx1 += invslope1x;
+    curx2 += invslope2x;
+    curz1 += invslope1z;
+    curz2 += invslope2z;
   }
 }
 
-//TODO: Implement a rasterization method, such as scanline.
+//Scanline algorithm for rasterization (inspired by http://www.sunshine2k.de/coding/java/TriangleRasterization/TriangleRasterization.html)
+__host__ __device__ void scanlineTriangle(triangle tri, fragment* depthbuffer, glm::vec2 resolution) {
+  //Sort the vertices by descending y
+  triangle sorted_tri;
+  sortTriangleOnY(tri, sorted_tri);
+
+  //Determine the 4th point to split on
+  glm::vec3 p3 = sorted_tri.p1;
+  float denom = (sorted_tri.p2.y - sorted_tri.p0.y);
+  float t = (abs(denom)>=0.75f) ? (sorted_tri.p1.y - sorted_tri.p0.y) / denom : 0.0f;
+  p3.x = sorted_tri.p0.x + t * (sorted_tri.p2.x - sorted_tri.p0.x);
+  p3.z = sorted_tri.p0.z + t * (sorted_tri.p2.z - sorted_tri.p0.z);
+
+  //Build the two triangles
+  triangle top;
+  top.p0 = sorted_tri.p0;
+  top.p1 = sorted_tri.p1;
+  top.p2 = p3;
+  top.n = tri.n;
+  triangle bottom;
+  bottom.p0 = sorted_tri.p1;
+  bottom.p1 = p3;
+  bottom.p2 = sorted_tri.p2;
+  bottom.n = tri.n;
+
+  //Fill in the two triangles
+  fillTopTriangle(top, depthbuffer, resolution);
+  fillBottomTriangle(bottom, depthbuffer, resolution);
+}
+
+//A rasterization kernel for triangle primitives
 __global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<primitivesCount){
+    //Copy the triangle locally for quick access
+    triangle this_tri = primitives[index];
+
+    //Scale the triangle into window coordinates
+    this_tri.p0.x = resolution.x*(1.0f - this_tri.p0.x)/2;
+    this_tri.p1.x = resolution.x*(1.0f - this_tri.p1.x)/2;
+    this_tri.p2.x = resolution.x*(1.0f - this_tri.p2.x)/2;
+    this_tri.p0.y = resolution.y*(1.0f - this_tri.p0.y)/2;
+    this_tri.p1.y = resolution.y*(1.0f - this_tri.p1.y)/2;
+    this_tri.p2.y = resolution.y*(1.0f - this_tri.p2.y)/2;
+
+    //Generate fragments from a scanline function
+    scanlineTriangle(this_tri, depthbuffer, resolution);
   }
 }
 
-//TODO: Implement a fragment shader
-__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution){
+//Phong shader
+__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution, ray light){
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
   if(x<=resolution.x && y<=resolution.y){
+    if (depthbuffer[index].position.z > -10000.0f) {
+      //Store the fragment info locally for accessibility
+      glm::vec3 V = depthbuffer[index].position;
+      glm::vec3 N = depthbuffer[index].normal;
+
+      //Compute necessary vectors
+      glm::vec3 L = glm::normalize(light.origin - V);
+      glm::vec3 E = glm::normalize(-V);
+      glm::vec3 R = glm::normalize(reflect(-L,N));
+
+      //Shininess
+      float specPow = 4.0f;
+
+      //Green (TODO: read from material)
+      glm::vec3 green(0.0f, 1.0f, 0.0f);
+
+      //Compute lighting
+      glm::vec3 ambient = 0.1f * green;
+      glm::vec3 diffuse = 0.45f * clamp(glm::dot(N, L), 0.0f, 1.0f) * green;
+      glm::vec3 specular = 0.45f * clamp(pow(max(glm::dot(R,E), 0.0f), specPow), 0.0f, 1.0f) * green;
+      depthbuffer[index].color = ambient + diffuse + specular;
+
+      //depthbuffer[index].color = green;
+    }
   }
 }
 
@@ -172,13 +554,17 @@ __global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* f
 }
 
 // Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
-void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize){
+void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, ray light, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, glm::mat4 mvp){
 
   // set up crucial magic
   int tileSize = 8;
   dim3 threadsPerBlock(tileSize, tileSize);
   dim3 fullBlocksPerGrid((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
 
+  tileSize = 32;
+  int primitiveBlocks = ceil(((float)vbosize / 3) / ((float)tileSize));
+  int numPrimitives = ibosize / 3;
+
   //set up framebuffer
   framebuffer = NULL;
   cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
@@ -186,21 +572,25 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   //set up depthbuffer
   depthbuffer = NULL;
   cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
+  fragment frag;
+  frag.color = glm::vec3(0, 0, 0);
+  frag.normal = glm::vec3(0, 0, 0);
+  frag.position = glm::vec3(0, 0, -10000.0f);
+  clearDepthBuffer << <fullBlocksPerGrid, threadsPerBlock >> >(resolution, depthbuffer, frag);
 
   //kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
   clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
-  
-  fragment frag;
-  frag.color = glm::vec3(0,0,0);
-  frag.normal = glm::vec3(0,0,0);
-  frag.position = glm::vec3(0,0,-10000);
-  clearDepthBuffer<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer,frag);
 
   //------------------------------
   //memory stuff
   //------------------------------
-  primitives = NULL;
-  cudaMalloc((void**)&primitives, (ibosize/3)*sizeof(triangle));
+  primitives1 = NULL;
+  cudaMalloc((void**)&primitives1, numPrimitives*sizeof(triangle));
+  primitives2 = NULL;
+  cudaMalloc((void**)&primitives2, numPrimitives*sizeof(triangle));
+  int tessLevels = 1;
+  primitives3 = NULL;
+  cudaMalloc((void**)&primitives3, (int)pow(3.0f,(float)tessLevels)*numPrimitives*sizeof(triangle));
 
   device_ibo = NULL;
   cudaMalloc((void**)&device_ibo, ibosize*sizeof(int));
@@ -214,32 +604,50 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
   cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
 
-  tileSize = 32;
-  int primitiveBlocks = ceil(((float)vbosize/3)/((float)tileSize));
-
+  device_nbo = NULL;
+  cudaMalloc((void**)&device_nbo, nbosize*sizeof(float));
+  cudaMemcpy(device_nbo, nbo, nbosize*sizeof(float), cudaMemcpyHostToDevice);
+  
   //------------------------------
   //vertex shader
   //------------------------------
-  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize);
+  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_nbo, nbosize, mvp);
 
   cudaDeviceSynchronize();
   //------------------------------
   //primitive assembly
   //------------------------------
-  primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
-  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives);
+  primitiveBlocks = ceil(((float)numPrimitives)/((float)tileSize));
+  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, device_nbo, nbosize, primitives1);
 
   cudaDeviceSynchronize();
+
   //------------------------------
-  //rasterization
+  //culling
+  //------------------------------
+  culling(primitives1, primitives2, numPrimitives);
+  primitiveBlocks = ceil(((float)numPrimitives)/((float)tileSize));
+
+  //------------------------------
+  //geometry shader
   //------------------------------
-  rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives, ibosize/3, depthbuffer, resolution);
+  geometryShadeKernel<<<primitiveBlocks, tileSize>>>(primitives2, primitives3, numPrimitives, tessLevels);
+  numPrimitives *= (int)pow(3.0f,(float) tessLevels);
+  primitiveBlocks = ceil(((float)numPrimitives) / ((float)tileSize));
 
   cudaDeviceSynchronize();
+
+  //------------------------------
+  //rasterization
   //------------------------------
+  rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives3, numPrimitives, depthbuffer, resolution);
+
+  cudaDeviceSynchronize();
+
+  //------------------------------ 
   //fragment shader
   //------------------------------
-  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
+  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution, light);
 
   cudaDeviceSynchronize();
   //------------------------------
@@ -256,10 +664,13 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
 }
 
 void kernelCleanup(){
-  cudaFree( primitives );
+  cudaFree( primitives1 );
+  cudaFree( primitives2 );
+  cudaFree( primitives3 );
   cudaFree( device_vbo );
   cudaFree( device_cbo );
   cudaFree( device_ibo );
+  cudaFree( device_nbo );
   cudaFree( framebuffer );
   cudaFree( depthbuffer );
 }
diff --git a/src/rasterizeKernels.h b/src/rasterizeKernels.h
index 784be17..98e62b5 100644
--- a/src/rasterizeKernels.h
+++ b/src/rasterizeKernels.h
@@ -10,7 +10,9 @@
 #include <cmath>
 #include "glm/glm.hpp"
 
+#include "sceneStructs.h"
+
 void kernelCleanup();
-void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize);
+void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, ray light, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, glm::mat4 mvp);
 
 #endif //RASTERIZEKERNEL_H
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index e9b5dcc..8124f8d 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -16,10 +16,11 @@ struct triangle {
   glm::vec3 c0;
   glm::vec3 c1;
   glm::vec3 c2;
+  glm::vec3 n;
 };
 
 struct fragment{
-  glm::vec3 color;
+	glm::vec3 color;
   glm::vec3 normal;
   glm::vec3 position;
 };
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
new file mode 100644
index 0000000..b4421b9
--- /dev/null
+++ b/src/sceneStructs.h
@@ -0,0 +1,98 @@
+// CIS565 CUDA Pathtracer: A parallel pathtracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
+// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
+// This file includes code from:
+//       Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com
+
+#ifndef CUDASTRUCTS_H
+#define CUDASTRUCTS_H
+
+#include "glm/glm.hpp"
+#include "cudaMat4.h"
+#include <cuda_runtime.h>
+#include <string>
+
+enum GEOMTYPE{ SPHERE, CUBE, MESH };
+
+struct ray {
+	glm::vec3 origin;
+	glm::vec3 direction;
+	glm::vec3 color;
+};
+
+struct geom {
+	enum GEOMTYPE type;
+	int materialid;
+	int meshid;
+	int frames;
+	glm::vec3* translations;
+	glm::vec3* rotations;
+	glm::vec3* scales;
+	cudaMat4* transforms;
+	cudaMat4* inverseTransforms;
+};
+
+struct staticGeom {
+	enum GEOMTYPE type;
+	int materialid;
+	int meshid;
+	glm::vec3 translation;
+	glm::vec3 rotation;
+	glm::vec3 scale;
+	cudaMat4 transform;
+	cudaMat4 inverseTransform;
+};
+
+struct material{
+	glm::vec3 color;
+	float specularExponent;
+	glm::vec3 specularColor;
+	float hasReflective;
+	float hasRefractive;
+	float indexOfRefraction;
+	float hasScatter;
+	glm::vec3 absorptionCoefficient;
+	float reducedScatterCoefficient;
+	float emittance;
+};
+
+struct mesh {
+	glm::vec3* vertices;
+	int* indices;
+	int numberOfTriangles;
+	int numberOfVertices;
+};
+
+struct worldSizes {
+	int numberOfGeoms;
+	int numberOfMaterials;
+	int numberOfMeshes;
+};
+
+struct worldData{
+	staticGeom* geoms;
+	material* materials;
+	mesh* meshes;
+};
+
+struct cameraData {
+	glm::vec2 resolution;
+	glm::vec3 position;
+	glm::vec3 view;
+	glm::vec3 up;
+	glm::vec2 fov;
+};
+
+struct camera {
+	glm::vec2 resolution;
+	glm::vec3* positions;
+	glm::vec3* views;
+	glm::vec3* ups;
+	int frames;
+	glm::vec2 fov;
+	unsigned int iterations;
+	glm::vec3* image;
+	ray* rayList;
+	std::string imageName;
+};
+
+#endif //CUDASTRUCTS_H
diff --git a/src/utilities.h b/src/utilities.h
index 3e6ef6e..43e4301 100644
--- a/src/utilities.h
+++ b/src/utilities.h
@@ -14,13 +14,13 @@
 #include <vector>
 #include "cudaMat4.h"
 
-const float PI                          =3.1415926535897932384626422832795028841971;
-const float TWO_PI                      =6.2831853071795864769252867665590057683943;
-const float SQRT_OF_ONE_THIRD           =0.5773502691896257645091487805019574556476;
-const float E                           =2.7182818284590452353602874713526624977572;
-const float EPSILON                     =.000000001;
-const float ZERO_ABSORPTION_EPSILON     =0.00001;
-const float RAY_BIAS_AMOUNT             =0.0002;
+const float PI                          =3.1415926535897932384626422832795028841971f;
+const float TWO_PI                      =6.2831853071795864769252867665590057683943f;
+const float SQRT_OF_ONE_THIRD           =0.5773502691896257645091487805019574556476f;
+const float E                           =2.7182818284590452353602874713526624977572f;
+const float EPSILON                     =.000000001f;
+const float ZERO_ABSORPTION_EPSILON     =0.00001f;
+const float RAY_BIAS_AMOUNT             =0.0002f;
 
 namespace utilityCore {
     extern float clamp(float f, float min, float max);
diff --git a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
index f640485..d204aef 100644
--- a/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
+++ b/windows/PROJ4_Rasterizer/PROJ4_Rasterizer/PROJ4_Rasterizer.vcxproj
@@ -1,5 +1,5 @@
 ﻿<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
   <ItemGroup Label="ProjectConfigurations">
     <ProjectConfiguration Include="Debug|Win32">
       <Configuration>Debug</Configuration>
@@ -19,16 +19,18 @@
     <ConfigurationType>Application</ConfigurationType>
     <UseDebugLibraries>true</UseDebugLibraries>
     <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v120</PlatformToolset>
   </PropertyGroup>
   <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
     <ConfigurationType>Application</ConfigurationType>
     <UseDebugLibraries>false</UseDebugLibraries>
     <WholeProgramOptimization>true</WholeProgramOptimization>
     <CharacterSet>MultiByte</CharacterSet>
+    <PlatformToolset>v120</PlatformToolset>
   </PropertyGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
   <ImportGroup Label="ExtensionSettings">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.props" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
   </ImportGroup>
   <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
     <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
@@ -71,6 +73,7 @@
     <ClInclude Include="..\..\..\src\main.h" />
     <ClInclude Include="..\..\..\src\rasterizeKernels.h" />
     <ClInclude Include="..\..\..\src\rasterizeTools.h" />
+    <ClInclude Include="..\..\..\src\sceneStructs.h" />
     <ClInclude Include="..\..\..\src\utilities.h" />
   </ItemGroup>
   <ItemGroup>
@@ -87,6 +90,6 @@
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
-    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 5.5.targets" />
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
   </ImportGroup>
 </Project>
\ No newline at end of file