diff --git a/README.md b/README.md
index ae0896a..70f2437 100644
--- a/README.md
+++ b/README.md
@@ -1,28 +1,15 @@
+![AntiAliasing Level 3](https://raw.githubusercontent.com/RTCassidy1/Project4-Rasterizer/master/renders/AALevel3.png)
 -------------------------------------------------------------------------------
 CIS565: Project 4: CUDA Rasterizer
 -------------------------------------------------------------------------------
 Fall 2014
 -------------------------------------------------------------------------------
-Due Monday 10/27/2014 @ 12 PM
--------------------------------------------------------------------------------
-
--------------------------------------------------------------------------------
-NOTE:
--------------------------------------------------------------------------------
-This project requires an NVIDIA graphics card with CUDA capability! Any card with CUDA compute capability 1.1 or higher will work fine for this project. For a full list of CUDA capable cards and their compute capability, please consult: http://developer.nvidia.com/cuda/cuda-gpus. If you do not have an NVIDIA graphics card in the machine you are working on, feel free to use any machine in the SIG Lab or in Moore100 labs. All machines in the SIG Lab and Moore100 are equipped with CUDA capable NVIDIA graphics cards. If this too proves to be a problem, please contact Patrick or Karl as soon as possible.
 
 -------------------------------------------------------------------------------
 INTRODUCTION:
 -------------------------------------------------------------------------------
-In this project, you will implement a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. In this project, you will implement vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and write the resulting fragments to a framebuffer. More information about the rasterized graphics pipeline can be found in the class slides and in your notes from CIS560.
-
-The basecode provided includes an OBJ loader and much of the mundane I/O and bookkeeping code. The basecode also includes some functions that you may find useful, described below. The core rasterization pipeline is left for you to implement.
+This is a simplified CUDA based implementation of a standard rasterized graphics pipeline, similar to the OpenGL pipeline. This project implements vertex shading, primitive assembly, perspective transformation, rasterization, fragment shading, and writes the resulting fragments to a framebuffer. 
 
-You MAY NOT use ANY raycasting/raytracing AT ALL in this project, EXCEPT in the fragment shader step. One of the purposes of this project is to see how a rasterization pipeline can generate graphics WITHOUT the need for raycasting! Raycasting may only be used in the fragment shader effect for interesting shading results, but is absolutely not allowed in any other stages of the pipeline.
-
-Also, you MAY NOT use OpenGL ANYWHERE in this project, aside from the given OpenGL code for drawing Pixel Buffer Objects to the screen. Use of OpenGL for any pipeline stage instead of your own custom implementation will result in an incomplete project.
-
-Finally, note that while this basecode is meant to serve as a strong starting point for a CUDA rasterizer, you are not required to use this basecode if you wish, and you may also change any part of the basecode specification as you please, so long as the final rendered result is correct.
 
 -------------------------------------------------------------------------------
 CONTENTS:
@@ -31,54 +18,14 @@ The Project4 root directory contains the following subdirectories:
 	
 * src/ contains the source code for the project. Both the Windows Visual Studio solution and the OSX makefile reference this folder for all source; the base source code compiles on OSX and Windows without modification.
 * objs/ contains example obj test files: cow.obj, cube.obj, tri.obj.
-* renders/ contains an example render of the given example cow.obj file with a z-depth fragment shader. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies needed for building and running on Windows 7.
-
-The Windows and OSX versions of the project build and run exactly the same way as in Project0, Project1, and Project2.
+* renders/ contains 3 videos of the rasterizer in action.
 
 -------------------------------------------------------------------------------
-REQUIREMENTS:
+ADDITIONAL FEATURES:
 -------------------------------------------------------------------------------
-In this project, you are given code for:
-
-* A library for loading/reading standard Alias/Wavefront .obj format mesh files and converting them to OpenGL style VBOs/IBOs
-* A suggested order of kernels with which to implement the graphics pipeline
-* Working code for CUDA-GL interop
-
-You will need to implement the following stages of the graphics pipeline and features:
-
-* Vertex Shading
-* Primitive Assembly with support for triangle VBOs/IBOs
-* Perspective Transformation
-* Rasterization through either a scanline or a tiled approach
-* Fragment Shading
-* A depth buffer for storing and depth testing fragments
-* Fragment to framebuffer writing
-* A simple lighting/shading scheme, such as Lambert or Blinn-Phong, implemented in the fragment shader
-
-You are also required to implement at least 3 of the following features:
-
-* Additional pipeline stages. Each one of these stages can count as 1 feature:
-   * Geometry shader
-   * Transformation feedback
-   * Back-face culling
-   * Scissor test
-   * Stencil test
-   * Blending
-
-IMPORTANT: For each of these stages implemented, you must also add a section to your README stating what the expected performance impact of that pipeline stage is, and real performance comparisons between your rasterizer with that stage and without.
-
 * Correct color interpolation between points on a primitive
-* Texture mapping WITH texture filtering and perspective correct texture coordinates
-* Support for additional primitices. Each one of these can count as HALF of a feature.
-   * Lines
-   * Line strips
-   * Triangle fans
-   * Triangle strips
-   * Points
+* Back-Face Culling
 * Anti-aliasing
-* Order-independent translucency using a k-buffer
-* MOUSE BASED interactive camera support. Interactive camera support based only on the keyboard is not acceptable for this feature.
 
 -------------------------------------------------------------------------------
 BASE CODE TOUR:
@@ -99,86 +46,25 @@ You will also want to familiarize yourself with:
 * utilities.h, which serves as a kitchen-sink of useful functions
 
 -------------------------------------------------------------------------------
-SOME RESOURCES:
--------------------------------------------------------------------------------
-The following resources may be useful for this project:
-
-* High-Performance Software Rasterization on GPUs
-	* Paper (HPG 2011): http://www.tml.tkk.fi/~samuli/publications/laine2011hpg_paper.pdf
-	* Code: http://code.google.com/p/cudaraster/ Note that looking over this code for reference with regard to the paper is fine, but we most likely will not grant any requests to actually incorporate any of this code into your project.
-	* Slides: http://bps11.idav.ucdavis.edu/talks/08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
-* The Direct3D 10 System (SIGGRAPH 2006) - for those interested in doing geometry shaders and transform feedback.
-	* http://133.11.9.3/~takeo/course/2006/media/papers/Direct3D10_siggraph2006.pdf
-* Multi-Fragment Eﬀects on the GPU using the k-Buﬀer - for those who want to do a k-buffer
-	* http://www.inf.ufrgs.br/~comba/papers/2007/kbuffer_preprint.pdf
-* FreePipe: A Programmable, Parallel Rendering Architecture for Efficient Multi-Fragment Effects (I3D 2010)
-	* https://sites.google.com/site/hmcen0921/cudarasterizer
-* Writing A Software Rasterizer In Javascript:
-	* Part 1: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-1.html
-	* Part 2: http://simonstechblog.blogspot.com/2012/04/software-rasterizer-part-2.html
-
--------------------------------------------------------------------------------
-NOTES ON GLM:
--------------------------------------------------------------------------------
-This project uses GLM, the GL Math library, for linear algebra. You need to know two important points on how GLM is used in this project:
-
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but pre-Fermi cards do not play nice with GLM matrices. As such, in this project, GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is provided as multiplyMV() in intersections.h.
-
--------------------------------------------------------------------------------
-README
+ADDITIONAL FEATURES TOUR:
 -------------------------------------------------------------------------------
-All students must replace or augment the contents of this Readme.md in a clear 
-manner with the following:
-
-* A brief description of the project and the specific features you implemented.
-* At least one screenshot of your project running.
-* A 30 second or longer video of your project running.  To create the video you
-  can use http://www.microsoft.com/expression/products/Encoder4_Overview.aspx 
-* A performance evaluation (described in detail below).
+* Correct color interpolation between points on a primitive
+	* my Triangles have support for per-vertex color.  In the Rasterization kernel I use the Barycentric coordinates of the triangle to apply the correct color value to a fragment based on its distance from the three vertices.
+	* As an aside, when first implementing this I had a sign error and implemented "Front-Face culling" While not a desirable feature, it made a funny video that can be found here: https://www.youtube.com/watch?v=q9GIzXXPtGc&feature=youtu.be
+* Back-Face Culling
+	* I included Back-Face culling in my primitive assembly and not as a separate feature.  I augmented my triangle struct to have a field indicating whether or not it had been culled.  
+	* In the Rasterization Kernel if a triangle has been culled the Kernel returns immediately.
+* Anti-Aliasing
+	* This is the most in depth feature.  In the cudaRasterizeCore() method there is a variable where you can set the Anti-Aliasing level.  If you leave this as 1 the rasterizer will act as normal, however setting this to a value larger than one will supersample the entire rasterization process by that many times in both the x and y direction.  When it comes to the Render Kernel it will downsample the fragments back to the given resolution using a gaussian distribution.
 
 -------------------------------------------------------------------------------
 PERFORMANCE EVALUATION
 -------------------------------------------------------------------------------
-The performance evaluation is where you will investigate how to make your CUDA
-programs more efficient using the skills you've learned in class. You must have
-performed at least one experiment on your code to investigate the positive or
-negative effects on performance. 
-
-We encourage you to get creative with your tweaks. Consider places in your code
-that could be considered bottlenecks and try to improve them. 
-
-Each student should provide no more than a one page summary of their
-optimizations along with tables and or graphs to visually explain any
-performance differences.
-
--------------------------------------------------------------------------------
-THIRD PARTY CODE POLICY
--------------------------------------------------------------------------------
-* Use of any third-party code must be approved by asking on Piazza.  If it is approved, all students are welcome to use it.  Generally, we approve use of third-party code that is not a core part of the project.  For example, for the ray tracer, we would approve using a third-party library for loading models, but would not approve copying and pasting a CUDA function for doing refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another student's code, is an academic integrity violation, and will result in you receiving an F for the semester.
-
--------------------------------------------------------------------------------
-SELF-GRADING
--------------------------------------------------------------------------------
-* On the submission date, email your grade, on a scale of 0 to 100, to Liam, harmoli+cis565@seas.upenn.edu, with a one paragraph explanation.  Be concise and realistic.  Recall that we reserve 30 points as a sanity check to adjust your grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We hope to only use this in extreme cases when your grade does not realistically reflect your work - it is either too high or too low.  In most cases, we plan to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as the path tracer.  We will determine the weighting at the end of the semester based on the size of each project.
-
----
-SUBMISSION
----
-As with the previous project, you should fork this project and work inside of
-your fork. Upon completion, commit your finished project back to your fork, and
-make a pull request to the master repository.  You should include a README.md
-file in the root directory detailing the following
-
-* A brief description of the project and specific features you implemented
-* At least one screenshot of your project running.
-* A link to a video of your raytracer running.
-* Instructions for building and running your project if they differ from the
-  base code.
-* A performance writeup as detailed above.
-* A list of all third-party code used.
-* This Readme file edited as described above in the README section.
-
+The biggest performance hit comes from Anti-Aliasing. Without anti-aliasing my rasterizer was able to render the cow at 60fps which I used as a baseline.  When I set the Antialiasing to 2x that dropped to 10 fps, at 3x it was 4-5fps, and at 5x it was 1-2fps.  
+* I had hoped that back-face culling would help with this, but I actually did not get any performance gains.  However I think this is because of my implementation.  When I culled the triangle, it still was submitted to the rasterization Kernel where it failed fast, but because of the warp size, there were probably very few warps that had ONLY back facing triangles.  
+	* What I need to do at a future iteration is to use string compaction to remove the culled triangles so they don't reach the rasterization kernel at all.
+* There is also room to improve the AntiAliasing as well. Currently I supersampled the entire image, but if I were to only superscale the edges I would produce a lot fewer fragments. 
+* Another quick improvement would be in the downsampling algorithm.  Currently I calculate the gaussian weight for every subpixel on every pixel. The weights could be computed ahead of time and passed to the kernels so they just read them instead of computing them on every frame for every subpixel.
+* I also could try to move the subpixel fragments color values to shared memory. With the antialiasing, multiple pixels will sample the same fragments, so if I set this up correctly I could probably reduce a lot of calls out to memory.
+I have two more videos of the renderer, one with AntiAliasing on 3x: https://www.youtube.com/watch?v=uRSzpbR4ZaQ&feature=youtu.be
+and one without AntiAliasing: https://www.youtube.com/watch?v=J8bXx7zOvN0&feature=youtu.be
diff --git a/renders/AALevel3.png b/renders/AALevel3.png
new file mode 100644
index 0000000..e0800ff
Binary files /dev/null and b/renders/AALevel3.png differ
diff --git a/renders/AALevel3.tiff b/renders/AALevel3.tiff
new file mode 100644
index 0000000..b7bda5b
Binary files /dev/null and b/renders/AALevel3.tiff differ
diff --git a/renders/AntiAliasLevel3.mp4 b/renders/AntiAliasLevel3.mp4
new file mode 100644
index 0000000..7edc0d6
Binary files /dev/null and b/renders/AntiAliasLevel3.mp4 differ
diff --git a/renders/AntiAliasOff.mp4 b/renders/AntiAliasOff.mp4
new file mode 100644
index 0000000..1b5d4d2
Binary files /dev/null and b/renders/AntiAliasOff.mp4 differ
diff --git a/renders/CullWrongSide.mp4 b/renders/CullWrongSide.mp4
new file mode 100644
index 0000000..84c6b82
Binary files /dev/null and b/renders/CullWrongSide.mp4 differ
diff --git a/src/main.cpp b/src/main.cpp
index 13d8e67..8733197 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -28,6 +28,9 @@ int main(int argc, char** argv){
     cout << "Usage: mesh=[obj file]" << endl;
     return 0;
   }
+  
+  //Setup Camera
+  cam = camera();
 
   frame = 0;
   seconds = time (NULL);
@@ -57,7 +60,7 @@ void mainLoop() {
     }
 
     string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
-		glfwSetWindowTitle(window, title.c_str());
+    glfwSetWindowTitle(window, title.c_str());
     
     glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
     glBindTexture(GL_TEXTURE_2D, displayImage);
@@ -92,9 +95,12 @@ void runCuda(){
 
   ibo = mesh->getIBO();
   ibosize = mesh->getIBOsize();
+  
+  nbo = mesh->getNBO();//get normals
+  nbosize = mesh->getNBOsize();//get normal length
 
   cudaGLMapBufferObject((void**)&dptr, pbo);
-  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize);
+  cudaRasterizeCore(dptr, glm::vec2(width, height), frame, vbo, vbosize, cbo, cbosize, ibo, ibosize, nbo, nbosize, cam);
   cudaGLUnmapBufferObject(pbo);
 
   vbo = NULL;
diff --git a/src/main.h b/src/main.h
index 8999110..b3b0405 100644
--- a/src/main.h
+++ b/src/main.h
@@ -23,6 +23,8 @@
 #include "rasterizeKernels.h"
 #include "utilities.h"
 
+
+
 using namespace std;
 
 //-------------------------------
@@ -49,6 +51,11 @@ float* cbo;
 int cbosize;
 int* ibo;
 int ibosize;
+float* nbo;   //added
+int nbosize;  //added
+
+//newstuff:
+camera cam;
 
 //-------------------------------
 //----------CUDA STUFF-----------
diff --git a/src/rasterizeKernels.cu b/src/rasterizeKernels.cu
index 10b0000..e882d96 100644
--- a/src/rasterizeKernels.cu
+++ b/src/rasterizeKernels.cu
@@ -8,10 +8,12 @@
 #include "rasterizeKernels.h"
 #include "rasterizeTools.h"
 
+
 glm::vec3* framebuffer;
 fragment* depthbuffer;
 float* device_vbo;
 float* device_cbo;
+float* device_nbo;//new
 int* device_ibo;
 triangle* primitives;
 
@@ -90,6 +92,7 @@ __global__ void clearDepthBuffer(glm::vec2 resolution, fragment* buffer, fragmen
       fragment f = frag;
       f.position.x = x;
       f.position.y = y;
+      f.position.z = 100000.0f; //invalid depth
       buffer[index] = f;
     }
 }
@@ -127,68 +130,294 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
       PBOpos[index].z = color.z;
   }
 }
+//Gaussian Kernel
+__host__ __device__ float gaussSample(float inX, float inY, float sigma){
+  float xSquared = inX * inX;
+  float ySquared = inY * inY;
+  float sigmaSquared = sigma * sigma;
+  float exponent = (xSquared + ySquared)/sigmaSquared;
+  float e = powf(E,-exponent);
+  return e / (2 * PI * sigmaSquared);
+   
+  
+}
 
 //TODO: Implement a vertex shader
-__global__ void vertexShadeKernel(float* vbo, int vbosize){
+__global__ void vertexShadeKernel(float* vbo, int vbosize, float* nbo, int nbosize, glm::mat4 View, glm::mat4 modelTransform){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<vbosize/3){
+  //convert from model space into eye/view space
+    glm::vec4 vert;
+    glm::vec4 norm;
+    vert = glm::vec4(vbo[(index * 3) + 0], vbo[(index * 3) + 1], vbo[(index * 3) + 2], 1.0f);//point
+    norm = glm::vec4(nbo[(index * 3) + 0], nbo[(index * 3) + 1], nbo[(index * 3) + 2], 0.0f);//vector
+    vert = View * modelTransform * vert;
+    norm = View * modelTransform * norm;
+    __syncthreads(); //This is slow, find a better way to do this
+    vbo[(index * 3) + 0] = vert.x;
+    vbo[(index * 3) + 1] = vert.y;
+    vbo[(index * 3) + 2] = vert.z;
+    nbo[(index * 3) + 0] = norm.x;
+    nbo[(index * 3) + 1] = norm.y;
+    nbo[(index * 3) + 2] = norm.z;
+    
+    
   }
 }
 
 //TODO: Implement primative assembly
-__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, triangle* primitives){
+__global__ void primitiveAssemblyKernel(float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, triangle* primitives, glm::mat4 projection){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   int primitivesCount = ibosize/3;
   if(index<primitivesCount){
+    //assemble the primitives!
+    //At this point vertices and normals are in eye space.
+    //not using ibo, not sure what problem is with ibo but it looks funky
+    int vboindex = index * 9;
+    int nboindex = vboindex;
+    
+    
+    glm::vec4 p0 = glm::vec4(0,0,0,1.0f);
+    glm::vec4 p1 = glm::vec4(0,0,0,1.0f);
+    glm::vec4 p2 = glm::vec4(0,0,0,1.0f);
+    
+    glm::vec3 c0 = glm::vec3(1,0,0);
+    glm::vec3 c1 = glm::vec3(0,1,0);
+    glm::vec3 c2 = glm::vec3(0,0,1);
+    
+    glm::vec3 n0 = glm::vec3(1.0f,0   ,0   );
+    glm::vec3 n1 = glm::vec3(0   ,1.0f,0   );
+    glm::vec3 n2 = glm::vec3(0   ,0   ,1.0f);
+    //get points
+    p0.x = vbo[vboindex];
+    p0.y = vbo[vboindex+ 1];
+    p0.z = vbo[vboindex + 2];
+    p1.x = vbo[vboindex + 3];
+    p1.y = vbo[vboindex + 4];
+    p1.z = vbo[vboindex + 5];
+    p2.x = vbo[vboindex + 6];
+    p2.y = vbo[vboindex + 7];
+    p2.z = vbo[vboindex + 8];
+    
+    
+    
+    //get colors (nothing here yet)
+    
+    //get normals
+    n0.x = nbo[nboindex];
+    n0.y = nbo[nboindex + 1];
+    n0.z = nbo[nboindex + 2];
+    n1.x = nbo[nboindex + 3];
+    n1.y = nbo[nboindex + 4];
+    n1.z = nbo[nboindex + 5];
+    n2.x = nbo[nboindex + 6];
+    n2.y = nbo[nboindex + 7];
+    n2.z = nbo[nboindex + 8];
+    //leave them in clip space for shading
+    
+    ////////////////////
+    //Back Face Culling!
+    ////////////////////
+    
+    bool back0;
+    bool back1;
+    bool back2;
+    back0 = glm::dot(n0,-glm::vec3(p0.x,p0.y,p0.z)) < 0;
+    back1 = glm::dot(n1,-glm::vec3(p1.x,p1.y,p1.z)) < 0;
+    back2 = glm::dot(n2,-glm::vec3(p2.x,p2.y,p2.z)) < 0;
+    if(back0 && back1 && back2){
+      primitives[index].culled = true;
+      return;//no need to do more work
+    }
+    
+    
+    //transform points into clip space
+    p0 = projection * p0;
+    p1 = projection * p1;
+    p2 = projection * p2;
+    
+    glm::vec3 transP0 = glm::vec3(p0.x/p0.w, p0.y/p0.w, p0.z/p0.w); 
+    glm::vec3 transP1 = glm::vec3(p1.x/p1.w, p1.y/p1.w, p1.z/p1.w); 
+    glm::vec3 transP2 = glm::vec3(p2.x/p2.w, p2.y/p2.w, p2.z/p2.w); 
+    
+    
+    //place triangles
+    primitives[index].p0 = transP0;
+    primitives[index].p1 = transP1;
+    primitives[index].p2 = transP2;
+    primitives[index].c0 = c0;
+    primitives[index].c1 = c1;
+    primitives[index].c2 = c2;
+    primitives[index].n0 = n0;
+    primitives[index].n1 = n1;
+    primitives[index].n2 = n2;
+    primitives[index].culled = false;
+
   }
 }
 
+
 //TODO: Implement a rasterization method, such as scanline.
-__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){
+__global__ void rasterizationKernel(triangle* primitives, int primitivesCount, fragment* depthbuffer, glm::vec2 resolution){ //, glm::vec3* devbbox){
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if(index<primitivesCount){
+    //convert to screen coordinates
+    triangle tri = primitives[index];
+    if(tri.culled == true){//did I cull this triangle?
+      return;
+    }
+    clipToScreen(tri, resolution);
+    //get bounding box for triangle
+    glm::vec3 bbMin;
+    glm::vec3 bbMax;
+    getAABBForTriangle(tri, bbMin, bbMax);
+    
+    /* if (index == 0){
+      devbbox[0] = tri.p0;
+      devbbox[1] = tri.p1;
+    } */
+    
+    int minX, minY, maxX, maxY;
+    minX = max(floor(bbMin.x), 0.0f);
+    minY = max(floor(bbMin.y), 0.0f);
+    maxX = min(ceil( bbMax.x), resolution.x);
+    maxY = min(ceil( bbMax.y), resolution.y);
+    
+    glm::vec2 coords;
+    glm::vec3 baryCords;
+    //iterate through the box to see in or out
+    for  (int i = minX; i < maxX; i++){
+      for(int j = minY; j < maxY; j++){
+        coords.x = float(i);
+        coords.y = float(j);
+        baryCords = calculateBarycentricCoordinate(tri, coords);
+        if(isBarycentricCoordInBounds(baryCords)){
+          fragment frag;
+          frag.position.x = coords.x;
+          frag.position.y = coords.y;
+          frag.position.z = tri.p0.z * baryCords.x + tri.p1.z * baryCords.y + tri.p2.z * baryCords.z;
+          
+          int dbIndex = i + (j * resolution.x);
+          fragment frag2 = depthbuffer[dbIndex];
+          
+          //need to figure how to do this atomically
+          if(frag.position.z >= 0.0f && frag.position.z <= 1.0f && frag.position.z < frag2.position.z){
+            frag.color  = tri.c0 * baryCords.x + tri.c1 * baryCords.y + tri.c2 * baryCords.z;
+            frag.normal = tri.n0 * baryCords.x + tri.n1 * baryCords.y + tri.n2 * baryCords.z;
+            depthbuffer[dbIndex] = frag;
+          }
+          
+        }
+      }
+    }
+    
   }
 }
 
 //TODO: Implement a fragment shader
-__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution){
+__global__ void fragmentShadeKernel(fragment* depthbuffer, glm::vec2 resolution, glm::vec3 lightPos){
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
   if(x<=resolution.x && y<=resolution.y){
+    if (depthbuffer[index].position.z == 100000.0f){
+      depthbuffer[index].color = glm::vec3((float)x/resolution.x, (float)y/resolution.y, 1.0f);
+    }else{
+      fragment frag = depthbuffer[index];
+      //Lambert
+      
+      glm::vec3 fragPos  = glm::vec3((frag.position.x / resolution.x) * 2 - 1, (frag.position.y / resolution.y) * 2 - 1, frag.position.z);
+      glm::vec3 lightDir = glm::normalize(lightPos - fragPos);
+      float diffuse = max(glm::dot(frag.normal,lightDir), 0.0);
+      depthbuffer[index].color = frag.color * diffuse;
+      
+      //Depth Render
+      //depthbuffer[index].color = frag.position.z * glm::vec3(1,1,1);
+    }
   }
 }
 
-//Writes fragment colors to the framebuffer
-__global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* framebuffer){
+
+
+//MODIFIED  Downsamples and then prints
+__global__ void render(glm::vec2 resolution, fragment* depthbuffer, glm::vec3* framebuffer, float AALevel){
 
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
 
   if(x<=resolution.x && y<=resolution.y){
-    framebuffer[index] = depthbuffer[index].color;
+    if(AALevel == 1.0f){
+      framebuffer[index] = depthbuffer[index].color;
+      return;
+    }
+    float dbX = x * AALevel + AALevel;
+    float dbY = y * AALevel + AALevel;
+    float tot = 0.0f;
+    glm::vec2 dbRes = resolution;
+    dbRes.x = resolution.x * AALevel + 2*AALevel;
+    dbRes.y = resolution.y * AALevel + 2*AALevel;
+    glm::vec3 color = glm::vec3(0,0,0);
+    for(int i = int(dbX) - int(AALevel); i < int(dbX) + int(AALevel) +1; i++){
+      for(int j = int(dbY) - int(AALevel); j < int(dbY) + int(AALevel) + 1; j ++){
+        if(i == int(dbX) && j == int(dbY)){
+          float gauss = gaussSample(0, 0, (AALevel)/3.0f); 
+          tot += gauss;
+          color += depthbuffer[i + (j * int(dbRes.x))].color * gauss;
+        }else{
+          float gauss = gaussSample((float(i) - dbX), (float(j) - dbY),  (AALevel)/3.0f);
+          color += depthbuffer[i + (j * int(dbRes.x))].color * gauss; 
+          tot += gauss;
+        }
+      }
+    }
+    framebuffer[index] = color * (1.0f / tot);
   }
 }
 
-// Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
-void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize){
 
+
+// Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
+void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, camera cam){
+  
+  //---------------
+  //AA level
+  //---------------
+  int AALevel = 3;
+  
+  //set up framebuffer
+  framebuffer = NULL;
+  cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
+  
   // set up crucial magic
   int tileSize = 8;
   dim3 threadsPerBlock(tileSize, tileSize);
   dim3 fullBlocksPerGrid((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
+  
+  //kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
+  clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
+  
+  glm::vec2 screenRes;
+  //superscale:
+  if (AALevel > 1){
+    screenRes = resolution;
+    resolution.x = resolution.x * AALevel + 2*AALevel;
+    resolution.y = resolution.y * AALevel + 2*AALevel;
+  }
+  // set up crucial magic
+  tileSize = 8;
+  threadsPerBlock = dim3(tileSize, tileSize);
+  fullBlocksPerGrid = dim3((int)ceil(float(resolution.x)/float(tileSize)), (int)ceil(float(resolution.y)/float(tileSize)));
 
-  //set up framebuffer
-  framebuffer = NULL;
-  cudaMalloc((void**)&framebuffer, (int)resolution.x*(int)resolution.y*sizeof(glm::vec3));
   
+  
+
   //set up depthbuffer
   depthbuffer = NULL;
   cudaMalloc((void**)&depthbuffer, (int)resolution.x*(int)resolution.y*sizeof(fragment));
+  
 
-  //kernel launches to black out accumulated/unaccumlated pixel buffers and clear our scattering states
-  clearImage<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, framebuffer, glm::vec3(0,0,0));
+  
   
   fragment frag;
   frag.color = glm::vec3(0,0,0);
@@ -199,6 +428,7 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   //------------------------------
   //memory stuff
   //------------------------------
+
   primitives = NULL;
   cudaMalloc((void**)&primitives, (ibosize/3)*sizeof(triangle));
 
@@ -214,44 +444,141 @@ void cudaRasterizeCore(uchar4* PBOpos, glm::vec2 resolution, float frame, float*
   cudaMalloc((void**)&device_cbo, cbosize*sizeof(float));
   cudaMemcpy( device_cbo, cbo, cbosize*sizeof(float), cudaMemcpyHostToDevice);
 
+  //normal buffer
+  device_nbo = NULL;
+  cudaMalloc((void**)&device_nbo, nbosize*sizeof(float));
+  cudaMemcpy( device_nbo, nbo, nbosize*sizeof(float), cudaMemcpyHostToDevice);
+
   tileSize = 32;
   int primitiveBlocks = ceil(((float)vbosize/3)/((float)tileSize));
+  
+  //Stuff I'll need in the process/
+  // view matrix
+  glm::mat4 View = glm::lookAt(cam.position, cam.forward, cam.up);
+  
+  // projection matrix
+  float aspectRatio = (float)resolution.x / (float)resolution.y;
+  glm::mat4 Projection = glm::perspective(cam.fovy, aspectRatio, cam.nearClip, cam.farClip);
+  
+  // model transform matrix
+  glm::vec3 translation  = glm::vec3(0.0f,0.0f,0.0f);
+  glm::vec3 rotation     = glm::vec3(0.0f,frame,0.0f);
+  glm::vec3 scale        = glm::vec3(1.0f,1.0f,1.0f);
+  glm::mat4 modelTransform = utilityCore::buildTransformationMatrix(translation, rotation, scale);
+  
+  //Light  Make this editable
+  glm::vec4 Light = glm::vec4(2,5,0,1);
+  Light = View * Light;
+  glm::vec3 lightPos = glm::vec3(Light.x, Light.y, Light.z); //in clip space
+  
+  /*
+  std::cout << "vbo length" << vbosize << "\n";
+  std::cout << "cbo length" << cbosize << "\n";
+  std::cout << "ibo length" << ibosize << "\n";
+  for(int i = 0; i < ibosize; i ++){
+    std::cout << ibo[i] << " , ";
+  }
+  std::cout << "\n";
+  exit(0);
+  */
+
 
   //------------------------------
-  //vertex shader
+  //vertex shader 
+  // convert to view coordinates
+  // displacement mapping
   //------------------------------
-  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize);
-
+  vertexShadeKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_nbo, nbosize, View, modelTransform);
   cudaDeviceSynchronize();
+  
   //------------------------------
   //primitive assembly
+  // assign points & colors to triangle vertices
+  // backface culling
   //------------------------------
   primitiveBlocks = ceil(((float)ibosize/3)/((float)tileSize));
-  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, primitives);
-
+  primitiveAssemblyKernel<<<primitiveBlocks, tileSize>>>(device_vbo, vbosize, device_cbo, cbosize, device_ibo, ibosize, device_nbo, nbosize, primitives, Projection);
   cudaDeviceSynchronize();
+
+
+  /*
+  triangle* primitives2 = new triangle[(ibosize/3)];
+  std::cout << "trying to memcopy \n";
+  cudaMemcpy(primitives2, primitives, (ibosize/3)*sizeof(triangle), cudaMemcpyDeviceToHost);
+  utilityCore::printVec3(primitives2[0].p0);
+  utilityCore::printVec3(primitives2[0].p1);
+  utilityCore::printVec3(primitives2[0].p2);
+  //utilityCore::printVec3(primitives2[0].n0);
+  //utilityCore::printVec3(primitives2[0].n1);
+  //utilityCore::printVec3(primitives2[0].n2);
+  
+  triangle tri = primitives2[0];
+  triangle newTri = primitives2[0];
+  
+  newTri.p0.x = ((tri.p0.x + 1) / 2) * resolution.x;
+  newTri.p0.y = ((tri.p0.y + 1) / 2) * resolution.y;
+  
+  newTri.p1.x = ((tri.p1.x + 1) / 2) * resolution.x;
+  newTri.p1.y = ((tri.p1.y + 1) / 2) * resolution.y;
+  
+  newTri.p2.x = ((tri.p2.x + 1) / 2) * resolution.x;
+  newTri.p2.y = ((tri.p2.y + 1) / 2) * resolution.y; 
+  
+  std::cout << "Comverted to screen coords \n";
+  std::cout << "Resolution" << resolution.x << " , " << resolution.y << "\n";
+  
+  utilityCore::printVec3(newTri.p0);
+  utilityCore::printVec3(newTri.p1);
+  utilityCore::printVec3(newTri.p2);
+  
+  
+  
+  //make temp memory for tests
+  glm::vec3* boundBox = new glm::vec3[2];
+  glm::vec3* dev_bbox;
+  cudaMalloc((void**)&dev_bbox, 2*sizeof(glm::vec3));
+  */
+  
   //------------------------------
   //rasterization
   //------------------------------
+  //std::cout << "Try to Rasterize \n";
   rasterizationKernel<<<primitiveBlocks, tileSize>>>(primitives, ibosize/3, depthbuffer, resolution);
-
   cudaDeviceSynchronize();
+  
+  
+  /*
+  cudaMemcpy(boundBox, dev_bbox, 2*sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+  std::cout << "Bounding Box coords \n";
+  utilityCore::printVec3(boundBox[0]);
+  utilityCore::printVec3(boundBox[1]);
+  */
+
+
+
   //------------------------------
   //fragment shader
   //------------------------------
-  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution);
+//std::cout << "Try to shade \n";
+  fragmentShadeKernel<<<fullBlocksPerGrid, threadsPerBlock>>>(depthbuffer, resolution, lightPos);
 
   cudaDeviceSynchronize();
   //------------------------------
   //write fragments to framebuffer
   //------------------------------
-  render<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer, framebuffer);
+  //Downample:
+  if (AALevel > 1){
+    resolution = screenRes;
+  }
+  render<<<fullBlocksPerGrid, threadsPerBlock>>>(resolution, depthbuffer, framebuffer, float(AALevel));
   sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, resolution, framebuffer);
 
   cudaDeviceSynchronize();
 
   kernelCleanup();
 
+  //exit(0);
+
   checkCUDAError("Kernel failed!");
 }
 
diff --git a/src/rasterizeKernels.h b/src/rasterizeKernels.h
index 784be17..1ebfc6c 100644
--- a/src/rasterizeKernels.h
+++ b/src/rasterizeKernels.h
@@ -9,8 +9,30 @@
 #include <cuda.h>
 #include <cmath>
 #include "glm/glm.hpp"
+#include "glm/gtc/matrix_transform.hpp"
+
+//Additional Objects I think would be useful
+struct camera{
+  glm::vec3 position;
+  glm::vec3 up;
+  glm::vec3 forward;
+  float fovy;
+  float nearClip;
+  float farClip;
+  camera(){//initialize camera to default values
+    position = glm::vec3(1,.25,1.25);
+    up       = glm::vec3(0,-1,0);
+    forward  = glm::vec3(0,0.25,0);
+    fovy     = 45.0f;
+    nearClip = 0.1f;
+    farClip  = 8.0f;
+  }
+};
+
 
 void kernelCleanup();
-void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize);
+void cudaRasterizeCore(uchar4* pos, glm::vec2 resolution, float frame, float* vbo, int vbosize, float* cbo, int cbosize, int* ibo, int ibosize, float* nbo, int nbosize, camera cam);
+
+
 
 #endif //RASTERIZEKERNEL_H
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index e9b5dcc..371832d 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -16,6 +16,10 @@ struct triangle {
   glm::vec3 c0;
   glm::vec3 c1;
   glm::vec3 c2;
+  glm::vec3 n0;
+  glm::vec3 n1;
+  glm::vec3 n2;
+  bool culled;
 };
 
 struct fragment{
@@ -24,6 +28,8 @@ struct fragment{
   glm::vec3 position;
 };
 
+
+
 //Multiplies a cudaMat4 matrix and a vec4
 __host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v){
   glm::vec3 r(1,1,1);
@@ -75,4 +81,25 @@ __host__ __device__ float getZAtCoordinate(glm::vec3 barycentricCoord, triangle
   return -(barycentricCoord.x*tri.p0.z + barycentricCoord.y*tri.p1.z + barycentricCoord.z*tri.p2.z);
 }
 
+//convert clip coordinates to screen coordinates
+__host__ __device__ void clipToScreen(triangle& tri, glm::vec2 resolution){
+  tri.p0.x = ((tri.p0.x + 1) / 2) * resolution.x;
+  tri.p0.y = ((tri.p0.y + 1) / 2) * resolution.y;
+  
+  tri.p1.x = ((tri.p1.x + 1) / 2) * resolution.x;
+  tri.p1.y = ((tri.p1.y + 1) / 2) * resolution.y;
+  
+  tri.p2.x = ((tri.p2.x + 1) / 2) * resolution.x;
+  tri.p2.y = ((tri.p2.y + 1) / 2) * resolution.y; 
+}
+
+//convert screen coordinates to clip coordinates
+__host__ __device__ glm::vec3 screenToClip(glm::vec3 pos, glm::vec2 resolution){
+  glm::vec3 newPos;
+  newPos.x = (pos.x / resolution.x) * 2 - 1;
+  newPos.y = (pos.y / resolution.y) * 2 - 1;
+  newPos.z = pos.z;
+  return newPos;
+}
+
 #endif
\ No newline at end of file