diff --git a/README.md b/README.md
index 110697c..70b8d90 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,51 @@ CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yan Wu
+* [LinknedIn](https://www.linkedin.com/in/yan-wu-a71270159/)
+* Tested on: Windows 10, i7-8750H @ 2.20GHz 16GB, GTX 1060 6GB (Personal Laptop)
 
-### (TODO: Your README)
+### Project Description
+ In this project, I implemented a CUDA-based path tracer capable of rendering images. Since in this class we are concerned with working in GPU programming, performance, and the generation of actual beautiful images (and not with mundane programming tasks like I/O), this project includes base code for loading a scene description file, described below, and various other things that generally make up a framework for previewing and saving images.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Features
+* Basic Features:
+   - A shading kernel with BSDF evaluation for:
+      - Ideal Diffuse surfaces (using provided cosine-weighted scatter function, see below.) [PBRT 8.3].
+      - Perfectly specular-reflective (mirrored) surfaces (e.g. using glm::reflect).
+      - See notes on diffuse/specular in scatterRay and on imperfect specular below.
+   - Path continuation/termination using Stream Compaction
+   - Implemented a means of making rays/pathSegments/intersections contiguous in memory by material type
+      - Sort the rays/path segments so that rays/paths interacting with the same material are contiguous in memory
+   - A toggleable option to cache the first bounce intersections for re-use across all subsequent iterations.
 
+* Extra Features:
+   - Refraction (e.g. glass/water) [PBRT 8.2] with Frensel effects using Schlick's approximation [PBRT 8.5]. 
+   - Physically-based depth-of-field (by jittering rays within an aperture) [PBRT 6.2.3].
+   - Antialiasing.
+   - More exciting features in the future...
+ 
+### Result and comparison
+* Original scene without anything: <br />
+   <img src="images/initial.gif" width = "50%"> <br />
+* Without implementing Stream Compaction(So slow!): <br />
+   <img src="images/withoutCompaction.gif" width = "50%"> <br />
+   We can see that even those useless paths are taken into count. By using stream compaction we can get a way more efficient outcome.
+
+* With fraction vs Without fraction:<br />
+   <img src="images/withoutRefract.png" width = "45%"> <img src="images/sorting slow.png" width = "45%"> <br />
+   More realistic with refraction implemented!
+* With Anti-Aliasing vs Non-Anti-Aliasing:<br />
+    <img src="images/no_antialiasing.png" width = "45%"> <img src="images/antialiasing1608samp.png" width = "45%"> <br />
+    <img src="images/aliasCompare.PNG" width = "80%"> <br />
+    If we check carefully at the baseline of the walls, especially on the back wall, we could see that the anti-aliasing feature is working.
+* With DOF(Depth of Field) vs No DOF:<br />
+    <img src="images/withoutDOF.png" width = "45%"> <img src="images/0.5DOF20.png" width = "45%"><br />
+    Depth of Field implemented great, we can totally see the difference as distance changes. The program with DOF only executed 433 loop so the right picture is a little too blurry.
+    
+### Bloopers
+* Here are some interesting bugs I encountered when I was trying to implement schlick's approximation(left) and DOF(right).<br />
+    <img src="images/schlick wrong 2.png" width = "45%"> <img src="images/dof10samp364.png" width = "45%"><br />
+
+
+  
diff --git a/images/0.5DOF20.png b/images/0.5DOF20.png
new file mode 100644
index 0000000..947d978
Binary files /dev/null and b/images/0.5DOF20.png differ
diff --git a/images/0.5dofsamp131.png b/images/0.5dofsamp131.png
new file mode 100644
index 0000000..98215f0
Binary files /dev/null and b/images/0.5dofsamp131.png differ
diff --git a/images/aliasCompare.PNG b/images/aliasCompare.PNG
new file mode 100644
index 0000000..4b8225d
Binary files /dev/null and b/images/aliasCompare.PNG differ
diff --git a/images/antialiasing1608samp.png b/images/antialiasing1608samp.png
new file mode 100644
index 0000000..6a18fe6
Binary files /dev/null and b/images/antialiasing1608samp.png differ
diff --git a/images/antialiasing420samp.png b/images/antialiasing420samp.png
new file mode 100644
index 0000000..f79b0f7
Binary files /dev/null and b/images/antialiasing420samp.png differ
diff --git a/images/cache 3194sample.png b/images/cache 3194sample.png
new file mode 100644
index 0000000..1f287ec
Binary files /dev/null and b/images/cache 3194sample.png differ
diff --git a/images/cache.png b/images/cache.png
new file mode 100644
index 0000000..c8c4e4f
Binary files /dev/null and b/images/cache.png differ
diff --git a/images/cornell.2018-10-02_23-37-45z.267samp.png b/images/cornell.2018-10-02_23-37-45z.267samp.png
new file mode 100644
index 0000000..830dff3
Binary files /dev/null and b/images/cornell.2018-10-02_23-37-45z.267samp.png differ
diff --git a/images/dof10samp364.png b/images/dof10samp364.png
new file mode 100644
index 0000000..ce7026f
Binary files /dev/null and b/images/dof10samp364.png differ
diff --git a/images/dof16samp358.png b/images/dof16samp358.png
new file mode 100644
index 0000000..8a89a5d
Binary files /dev/null and b/images/dof16samp358.png differ
diff --git a/images/dof20samp598.png b/images/dof20samp598.png
new file mode 100644
index 0000000..583ac95
Binary files /dev/null and b/images/dof20samp598.png differ
diff --git a/images/initial.gif b/images/initial.gif
new file mode 100644
index 0000000..27d49c5
Binary files /dev/null and b/images/initial.gif differ
diff --git a/images/no_antialiasing.png b/images/no_antialiasing.png
new file mode 100644
index 0000000..27f0c7f
Binary files /dev/null and b/images/no_antialiasing.png differ
diff --git a/images/schlick working.png b/images/schlick working.png
new file mode 100644
index 0000000..3007c9f
Binary files /dev/null and b/images/schlick working.png differ
diff --git a/images/schlick wrong 1.png b/images/schlick wrong 1.png
new file mode 100644
index 0000000..8bad2c5
Binary files /dev/null and b/images/schlick wrong 1.png differ
diff --git a/images/schlick wrong 2.png b/images/schlick wrong 2.png
new file mode 100644
index 0000000..aeb955e
Binary files /dev/null and b/images/schlick wrong 2.png differ
diff --git a/images/sorting slow.png b/images/sorting slow.png
new file mode 100644
index 0000000..fd815e0
Binary files /dev/null and b/images/sorting slow.png differ
diff --git a/images/withoutCompaction.gif b/images/withoutCompaction.gif
new file mode 100644
index 0000000..9547edc
Binary files /dev/null and b/images/withoutCompaction.gif differ
diff --git a/images/withoutDOF.png b/images/withoutDOF.png
new file mode 100644
index 0000000..583ac95
Binary files /dev/null and b/images/withoutDOF.png differ
diff --git a/images/withoutRefract.png b/images/withoutRefract.png
new file mode 100644
index 0000000..589ef1b
Binary files /dev/null and b/images/withoutRefract.png differ
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
index 83ff820..122ebe4 100644
--- a/scenes/cornell.txt
+++ b/scenes/cornell.txt
@@ -1,4 +1,4 @@
-// Emissive material (light)
+﻿// Emissive material (light)
 MATERIAL 0
 RGB         1 1 1
 SPECEX      0
diff --git a/scenes/cornell_new.txt b/scenes/cornell_new.txt
new file mode 100644
index 0000000..3074208
--- /dev/null
+++ b/scenes/cornell_new.txt
@@ -0,0 +1,197 @@
+// Emissive material (light)
+MATERIAL 0
+RGB         1 1 1
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   5
+
+// Diffuse white
+MATERIAL 1
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse red
+MATERIAL 2
+RGB         .85 .35 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse green
+MATERIAL 3
+RGB         .35 .85 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Specular white
+MATERIAL 4
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        1
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Camera
+CAMERA
+RES         800 800
+FOVY        45
+ITERATIONS  5000
+DEPTH       8
+FILE        cornell
+EYE         0.0 5 10.5
+LOOKAT      0 5 0
+UP          0 1 0
+
+
+// Ceiling light
+OBJECT 0
+cube
+material 0
+TRANS       0 10 0
+ROTAT       0 0 0
+SCALE       3 .3 3
+
+// Floor
+OBJECT 1
+cube
+material 1
+TRANS       0 0 0
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 2
+cube
+material 1
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Back wall
+OBJECT 3
+cube
+material 1
+TRANS       0 5 -15
+ROTAT       0 90 0
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 4
+cube
+material 2
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 5
+cube
+material 3
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 6
+sphere
+material 4
+TRANS       1 3 -2
+ROTAT       0 0 0
+SCALE       2 2 2
+
+// Floor
+OBJECT 7
+cube
+material 1
+TRANS       0 0 -10
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 8
+cube
+material 1
+TRANS       0 10 -10
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 9
+cube
+material 2
+TRANS       -5 5 -10
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 10
+cube
+material 3
+TRANS       5 5 -10
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 11
+sphere
+material 4
+TRANS       1 3 2
+ROTAT       0 0 0
+SCALE       2 2 2
+
+// Sphere
+OBJECT 12
+sphere
+material 4
+TRANS       1 3 -6
+ROTAT       0 0 0
+SCALE       2 2 2
+
+// Sphere
+OBJECT 13
+sphere
+material 4
+TRANS       1 3 -9
+ROTAT       0 0 0
+SCALE       2 2 2
+
+// Sphere
+OBJECT 14
+sphere
+material 4
+TRANS       1 3 -12
+ROTAT       0 0 0
+SCALE       2 2 2
+
+// Ceiling light
+OBJECT 15
+cube
+material 0
+TRANS       5 5 -5
+ROTAT       0 0 0
+SCALE       .3 0.3 15
+
+// Ceiling light
+OBJECT 16
+cube
+material 0
+TRANS       -5 5 -5
+ROTAT       0 0 0
+SCALE       .3 0.3 15
\ No newline at end of file
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a1cb3fb..3ca4297 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -19,5 +19,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/src/interactions.h b/src/interactions.h
index 5ce3628..64761cb 100644
--- a/src/interactions.h
+++ b/src/interactions.h
@@ -76,4 +76,47 @@ void scatterRay(
     // TODO: implement this.
     // A basic implementation of pure-diffuse shading will just call the
     // calculateRandomDirectionInHemisphere defined above.
+	thrust::uniform_real_distribution<float> u01(0, 1);
+	float probability = u01(rng);
+	if (m.hasRefractive) {
+		float costheta = glm::dot(pathSegment.ray.direction, normal);
+		float n1 = costheta < 0 ? 1 : m.indexOfRefraction;
+		float n2 = costheta < 0 ? m.indexOfRefraction : 1;
+		float eta = n1 / n2;
+		float R0 = powf((n1 - n2) / (n1 + n2), 2);
+		float R_theta = R0 + (1 - R0)*powf(1 - glm::abs(costheta), 5);
+		if (probability > R_theta) {
+			pathSegment.ray.direction = glm::refract(pathSegment.ray.direction, normal, eta);
+			if(costheta >= 0) pathSegment.color *= m.color;
+		}
+		else {
+			pathSegment.ray.direction = glm::reflect(pathSegment.ray.direction, normal);
+			pathSegment.color *= m.specular.color;
+		}
+	}
+	else if (probability > m.hasReflective) {
+		pathSegment.color *= m.color;
+		pathSegment.ray.direction = calculateRandomDirectionInHemisphere(normal, rng);
+	}
+	else {
+		// specular reflection
+		pathSegment.color *= m.specular.color;
+		pathSegment.ray.direction = glm::reflect(pathSegment.ray.direction, normal);
+	}
+	pathSegment.ray.origin = intersect + 0.01f * pathSegment.ray.direction;
+}
+
+__host__ __device__
+void scatterRay1(
+	PathSegment & pathSegment,
+	glm::vec3 intersect,
+	glm::vec3 normal,
+	const Material &m,
+	thrust::default_random_engine &rng) {
+	// TODO: implement this.
+	// A basic implementation of pure-diffuse shading will just call the
+	// calculateRandomDirectionInHemisphere defined above.
+	pathSegment.ray.direction = calculateRandomDirectionInHemisphere(normal, rng);
+	pathSegment.color *= m.color;
+	pathSegment.ray.origin = intersect + normal * 0.01f;
 }
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index c1ec122..804ef7d 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -13,8 +13,15 @@
 #include "pathtrace.h"
 #include "intersections.h"
 #include "interactions.h"
+#include "device_launch_parameters.h"
+#include <thrust/device_vector.h>
+#include <thrust/host_vector.h>
 
 #define ERRORCHECK 1
+#define CACHE 1
+#define SORT 1
+#define ANTIALIASING 0
+#define DEPTHOFFIELD 0
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
@@ -38,6 +45,18 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {
 #endif
 }
 
+struct checkRemainingBounces {
+	__host__ __device__ bool operator()(const PathSegment &pathSegment) {
+		return pathSegment.remainingBounces > 0;
+	}
+};
+
+struct sortMaterialID {
+	__host__ __device__ bool operator()(const ShadeableIntersection &o1, ShadeableIntersection &o2) {
+		return o1.materialId < o2.materialId;
+	}
+};
+
 __host__ __device__
 thrust::default_random_engine makeSeededRandomEngine(int iter, int index, int depth) {
     int h = utilhash((1 << 31) | (depth << 22) | iter) ^ utilhash(index);
@@ -75,6 +94,7 @@ static PathSegment * dev_paths = NULL;
 static ShadeableIntersection * dev_intersections = NULL;
 // TODO: static variables for device memory, any extra info you need, etc
 // ...
+static ShadeableIntersection * cache = NULL;
 
 void pathtraceInit(Scene *scene) {
     hst_scene = scene;
@@ -97,6 +117,9 @@ void pathtraceInit(Scene *scene) {
 
     // TODO: initialize any extra device memeory you need
 
+	cudaMalloc(&cache, pixelcount * sizeof(ShadeableIntersection));
+	cudaMemset(cache, 0, pixelcount * sizeof(ShadeableIntersection));
+
     checkCUDAError("pathtraceInit");
 }
 
@@ -107,7 +130,7 @@ void pathtraceFree() {
   	cudaFree(dev_materials);
   	cudaFree(dev_intersections);
     // TODO: clean up any extra device memory you created
-
+	cudaFree(cache);
     checkCUDAError("pathtraceFree");
 }
 
@@ -119,6 +142,24 @@ void pathtraceFree() {
 * motion blur - jitter rays "in time"
 * lens effect - jitter ray origin positions based on a lens
 */
+
+__device__ glm::fvec2 ConcentricSampleDisk(const glm::fvec2 &u) {
+	glm::fvec2 uOffSet = 2.f * u - glm::fvec2(1.f, 1.f);
+	if(uOffSet.x == 0.f && uOffSet.y == 0.f) {
+		return glm::fvec2(0.f);
+	}
+	float theta, r;
+	if (glm::abs(uOffSet.x) > glm::abs(uOffSet.y)) {
+		r = uOffSet.x;
+		theta = PI / 4 * (uOffSet.y / uOffSet.x);
+	}
+	else {
+		r = uOffSet.y;
+		theta = PI / 2 - PI / 4 * (uOffSet.x / uOffSet.y);
+	}
+	return r * glm::fvec2(cos(theta), sin(theta));
+}
+
 __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, PathSegment* pathSegments)
 {
 	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -129,14 +170,57 @@ __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, Path
 		PathSegment & segment = pathSegments[index];
 
 		segment.ray.origin = cam.position;
-    segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+		segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
 
 		// TODO: implement antialiasing by jittering the ray
+#if ANTIALIASING
+		thrust::default_random_engine rng = makeSeededRandomEngine(iter, index, 0);
+		thrust::uniform_real_distribution<float> u01(0, 1);
+
+		segment.ray.direction = glm::normalize(cam.view
+			- cam.right * cam.pixelLength.x * ((float)x + u01(rng) - (float)cam.resolution.x * 0.5f)
+			- cam.up * cam.pixelLength.y * ((float)y + u01(rng) - (float)cam.resolution.y * 0.5f)
+		);
+#else
 		segment.ray.direction = glm::normalize(cam.view
 			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
 			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
 			);
+#endif
 
+#if DEPTHOFFIELD
+		float lensRadius = 1.f;
+		float focalDistance = 20.f;
+		if (lensRadius > 0.f) {
+			// on PBRT page 374
+			// sample point on lens
+			thrust::default_random_engine rng = makeSeededRandomEngine(iter, index, 0);
+			thrust::uniform_real_distribution<float> u01(0, 1);
+			glm::fvec2 pLens = lensRadius * ConcentricSampleDisk(glm::fvec2(u01, u01));
+			// compute
+			glm::vec3 pFocus = segment.ray.direction * glm::abs(focalDistance / segment.ray.direction.z) + segment.ray.origin;
+
+			//update
+			segment.ray.origin += cam.right * pLens.x + cam.up * pLens.y;
+			segment.ray.direction = glm::normalize(pFocus - segment.ray.origin);
+		}
+		//float lensRadius = 1.f;
+		//float focalDistance = 10.f;
+		//if (lensRadius > 0.f) {
+		//	// on PBRT page 374
+		//	// sample point on lens
+		//	thrust::default_random_engine rng = makeSeededRandomEngine(iter, index, 0);
+		//	thrust::uniform_real_distribution<float> u01(-1, 1);
+		//	float pLensx = focalDistance * u01(rng);
+		//	float pLensy = focalDistance * u01(rng);
+		//	// compute
+		//	glm::vec3 pFocus = segment.ray.direction * focalDistance + segment.ray.origin;
+
+		//	//update
+		//	segment.ray.origin += cam.right * pLensx + cam.up * pLensy;
+		//	segment.ray.direction = glm::normalize(pFocus - segment.ray.origin);
+		//}
+#endif
 		segment.pixelIndex = index;
 		segment.remainingBounces = traceDepth;
 	}
@@ -208,6 +292,7 @@ __global__ void computeIntersections(
 			intersections[path_index].t = t_min;
 			intersections[path_index].materialId = geoms[hit_geom_index].materialid;
 			intersections[path_index].surfaceNormal = normal;
+			intersections[path_index].intersect = intersect_point;
 		}
 	}
 }
@@ -246,14 +331,17 @@ __global__ void shadeFakeMaterial (
       // If the material indicates that the object was a light, "light" the ray
       if (material.emittance > 0.0f) {
         pathSegments[idx].color *= (materialColor * material.emittance);
+		pathSegments[idx].remainingBounces = 0;
       }
       // Otherwise, do some pseudo-lighting computation. This is actually more
       // like what you would expect from shading in a rasterizer like OpenGL.
       // TODO: replace this! you should be able to start with basically a one-liner
       else {
-        float lightTerm = glm::dot(intersection.surfaceNormal, glm::vec3(0.0f, 1.0f, 0.0f));
+        /*float lightTerm = glm::dot(intersection.surfaceNormal, glm::vec3(0.0f, 1.0f, 0.0f));
         pathSegments[idx].color *= (materialColor * lightTerm) * 0.3f + ((1.0f - intersection.t * 0.02f) * materialColor) * 0.7f;
-        pathSegments[idx].color *= u01(rng); // apply some noise because why not
+        pathSegments[idx].color *= u01(rng);*/ // apply some noise because why not
+		  scatterRay1(pathSegments[idx], intersection.intersect, intersection.surfaceNormal, material, rng);
+		  pathSegments[idx].remainingBounces--;
       }
     // If there was no intersection, color the ray black.
     // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
@@ -261,6 +349,7 @@ __global__ void shadeFakeMaterial (
     // This can be useful for post-processing and image compositing.
     } else {
       pathSegments[idx].color = glm::vec3(0.0f);
+	  pathSegments[idx].remainingBounces = 0;
     }
   }
 }
@@ -277,6 +366,8 @@ __global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterati
 	}
 }
 
+
+
 /**
  * Wrapper for the __global__ call that sets up the kernel calls and does a ton
  * of memory management
@@ -332,6 +423,7 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
 	int depth = 0;
 	PathSegment* dev_path_end = dev_paths + pixelcount;
 	int num_paths = dev_path_end - dev_paths;
+	int initial_num_paths = num_paths;
 
 	// --- PathSegment Tracing Stage ---
 	// Shoot ray into scene, bounce between objects, push shading chunks
@@ -344,6 +436,32 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
 
 	// tracing
 	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+#if CACHE
+	if (iter == 1 && depth == 0) {
+		computeIntersections << <numblocksPathSegmentTracing, blockSize1d >> > (
+			depth
+			, num_paths
+			, dev_paths
+			, dev_geoms
+			, hst_scene->geoms.size()
+			, dev_intersections
+			);
+		cudaMemcpy(cache, dev_intersections, pixelcount * sizeof(ShadeableIntersection), cudaMemcpyDeviceToDevice);
+	}
+	else if (depth == 0) {
+		cudaMemcpy(dev_intersections, cache, pixelcount * sizeof(ShadeableIntersection), cudaMemcpyDeviceToDevice);
+	}
+	else {
+		computeIntersections << <numblocksPathSegmentTracing, blockSize1d >> > (
+			depth
+			, num_paths
+			, dev_paths
+			, dev_geoms
+			, hst_scene->geoms.size()
+			, dev_intersections
+			);
+	}
+# else
 	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
 		depth
 		, num_paths
@@ -352,6 +470,7 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
 		, hst_scene->geoms.size()
 		, dev_intersections
 		);
+#endif
 	checkCUDAError("trace one bounce");
 	cudaDeviceSynchronize();
 	depth++;
@@ -360,25 +479,32 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
 	// TODO:
 	// --- Shading Stage ---
 	// Shade path segments based on intersections and generate new rays by
-  // evaluating the BSDF.
-  // Start off with just a big kernel that handles all the different
-  // materials you have in the scenefile.
-  // TODO: compare between directly shading the path segments and shading
-  // path segments that have been reshuffled to be contiguous in memory.
-
-  shadeFakeMaterial<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = true; // TODO: should be based off stream compaction results.
+	// evaluating the BSDF.
+	// Start off with just a big kernel that handles all the different
+	// materials you have in the scenefile.
+	// TODO: compare between directly shading the path segments and shading
+	// path segments that have been reshuffled to be contiguous in memory.
+
+#if SORT
+	thrust::sort_by_key(thrust::device, dev_intersections, dev_intersections + num_paths, dev_paths, sortMaterialID());
+#endif
+	
+	shadeFakeMaterial<<<numblocksPathSegmentTracing, blockSize1d>>> (
+		iter,
+		num_paths,
+		dev_intersections,
+		dev_paths,
+		dev_materials
+	);
+	// Stream Compaction
+	PathSegment* dev_path_end_2 = thrust::partition(thrust::device, dev_paths, dev_paths + num_paths, checkRemainingBounces());
+	num_paths = dev_path_end_2 - dev_paths;
+	if(num_paths <= 0) iterationComplete = true; // TODO: should be based off stream compaction results.
 	}
 
   // Assemble this iteration and apply it to the image
   dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
-	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
+	finalGather<<<numBlocksPixels, blockSize1d>>>(initial_num_paths, dev_image, dev_paths);
 
     ///////////////////////////////////////////////////////////////////////////
 
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index b38b820..0baa259 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -49,6 +49,7 @@ struct Camera {
     glm::vec3 right;
     glm::vec2 fov;
     glm::vec2 pixelLength;
+
 };
 
 struct RenderState {
@@ -73,4 +74,5 @@ struct ShadeableIntersection {
   float t;
   glm::vec3 surfaceNormal;
   int materialId;
+  glm::vec3 intersect;
 };
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index ac358c9..4bb0dc2 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -1,7 +1,17 @@
 set(SOURCE_FILES
+    "common.h"
+    "common.cu"
+    "cpu.h"
+    "cpu.cu"
+    "naive.h"
+    "naive.cu"
+    "efficient.h"
+    "efficient.cu"
+    "thrust.h"
+    "thrust.cu"
     )
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
new file mode 100644
index 0000000..ea8e47a
--- /dev/null
+++ b/stream_compaction/common.cu
@@ -0,0 +1,48 @@
+#include "common.h"
+#include "device_launch_parameters.h"
+
+void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+    cudaError_t err = cudaGetLastError();
+    if (cudaSuccess == err) {
+        return;
+    }
+
+    fprintf(stderr, "CUDA error");
+    if (file) {
+        fprintf(stderr, " (%s:%d)", file, line);
+    }
+    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+}
+
+
+namespace StreamCompaction {
+    namespace Common {
+
+        /**
+         * Maps an array to an array of 0s and 1s for stream compaction. Elements
+         * which map to 0 will be removed, and elements which map to 1 will be kept.
+         */
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+            // TODO
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+			bools[index] = idata[index] == 0 ? 0 : 1;
+        }
+
+        /**
+         * Performs scatter on an array. That is, for each element in idata,
+         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         */
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices) {
+            // TODO
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+			if (bools[index] == 1) {
+				odata[indices[index]] = idata[index];
+			}
+        }
+
+    }
+}
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
new file mode 100644
index 0000000..996997e
--- /dev/null
+++ b/stream_compaction/common.h
@@ -0,0 +1,132 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return x == 1 ? 0 : ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
new file mode 100644
index 0000000..0f5093c
--- /dev/null
+++ b/stream_compaction/cpu.cu
@@ -0,0 +1,74 @@
+#include <cstdio>
+#include "cpu.h"
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
+        }
+
+        /**
+         * CPU scan (prefix sum).
+         * For performance analysis, this is supposed to be a simple for loop.
+         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+         */
+        void scan(int n, int *odata, const int *idata) {
+	        //timer().startCpuTimer();
+            // TODO	
+			odata[0] = 0;
+			for (int i = 1; i < n; ++i) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
+	        //timer().endCpuTimer();
+        }
+
+        /**
+         * CPU stream compaction without using the scan function.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithoutScan(int n, int *odata, const int *idata) {
+	        timer().startCpuTimer();
+            // TODO
+			int index = 0;
+			for (int i = 0; i < n; i++) {
+				if (idata[i] != 0) {
+					odata[index++] = idata[i];
+				}
+			}
+	        timer().endCpuTimer();
+            return index;
+        }
+
+        /**
+         * CPU stream compaction using scan and scatter, like the parallel version.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithScan(int n, int *odata, const int *idata) {
+			// allocate space for two middle status arrays
+			int *temp1 = (int*)malloc(n * sizeof(int));
+			int *temp2 = (int*)malloc(n * sizeof(int));
+			timer().startCpuTimer();
+	        // TODO
+			for (int i = 0; i < n; ++i) {
+				temp1[i] = idata[i] == 0 ? 0 : 1;
+			}
+			scan(n, temp2, temp1);
+			int index = 0;
+			for (int i = 0; i < n; ++i) {
+				if (temp1[i] == 1) {
+					odata[temp2[i]] = idata[i];
+					index++;
+				}
+			}
+	        timer().endCpuTimer();
+            return index;
+        }
+    }
+}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
new file mode 100644
index 0000000..236ce11
--- /dev/null
+++ b/stream_compaction/cpu.h
@@ -0,0 +1,15 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+	    StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compactWithoutScan(int n, int *odata, const int *idata);
+
+        int compactWithScan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
new file mode 100644
index 0000000..1bfcb30
--- /dev/null
+++ b/stream_compaction/efficient.cu
@@ -0,0 +1,112 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "efficient.h"
+#include "device_launch_parameters.h"
+#include "../src/scene.h"
+#define blockSize 128
+
+
+namespace StreamCompaction {
+    namespace Efficient {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+		__global__ void upSweep(int n, int k, int *dev) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+
+			if ((index % (2 * k) == 0) && (index + (2 * k) <= n))
+				dev[index + (2 * k) - 1] += dev[index + k - 1];
+		}
+
+		__global__ void downSweep(int n, int k, int *idata) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+			// need to check boundary
+			if ((index % (2 * k) == 0) && (index + (2 * k) <= n)) {
+				int temp = idata[index + k - 1];
+				idata[index + k - 1] = idata[index + (2 * k) - 1];
+				idata[index + (2 * k) - 1] += temp;
+			}
+		}
+
+        void scan(int n, int *odata, const int *idata) {
+			int *exclusive;
+			int length = pow(2, ilog2ceil(n));
+			cudaMalloc((void**)&exclusive, length * sizeof(int));
+			cudaMemset(exclusive, 0, length * sizeof(int));
+			cudaMemcpy(exclusive, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			dim3 fullBlocksPerGrid((length + blockSize - 1) / blockSize);
+			//timer().startGpuTimer();
+			// TODO
+			// up-sweep
+			for (int d = 1; d < length; d *= 2) {
+				upSweep<<< fullBlocksPerGrid, blockSize >>>(length, d, exclusive);
+			}
+			cudaMemset(exclusive + length - 1, 0, sizeof(int));
+			// down-sweep
+			for (int d = length / 2; d >= 1; d /= 2) {
+				downSweep<<< fullBlocksPerGrid, blockSize >>>(length, d, exclusive);
+			}
+            //timer().endGpuTimer();
+			cudaMemcpy(odata, exclusive, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(exclusive);
+        }
+
+        /**
+         * Performs stream compaction on idata, storing the result into odata.
+         * All zeroes are discarded.
+         *
+         * @param n      The number of elements in idata.
+         * @param odata  The array into which to store elements.
+         * @param idata  The array of elements to compact.
+         * @returns      The number of elements remaining after compaction.
+         */
+        int compact(int n, PathSegment *odata,  PathSegment *idata) {
+			int *bools;
+			int *indices;
+			int *i_aug;
+			int *o_aug;
+			int length = pow(2, ilog2ceil(n));
+			cudaMalloc((void**)&bools, length * sizeof(int));
+			cudaMalloc((void**)&indices, length * sizeof(int));
+			cudaMalloc((void**)&i_aug, n * sizeof(int));
+			cudaMalloc((void**)&o_aug, n * sizeof(int));
+
+			cudaMemset(bools, 0, length * sizeof(int));
+			cudaMemset(indices, 0, length * sizeof(int));
+
+			cudaMemcpy(i_aug, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			dim3 fullBlocksPerGrid((length + blockSize - 1) / blockSize);
+
+			//timer().startGpuTimer();
+            // TODO
+			StreamCompaction::Common::kernMapToBoolean<<< fullBlocksPerGrid, blockSize >>>(n, bools, i_aug);
+			scan(n, indices, bools);
+			StreamCompaction::Common::kernScatter <<< fullBlocksPerGrid, blockSize >>>(n, o_aug, i_aug, bools, indices);
+            //timer().endGpuTimer();
+			cudaMemcpy(odata, o_aug, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+			int num1 = 0;
+			cudaMemcpy(&num1, &bools[n - 1], sizeof(int), cudaMemcpyDeviceToHost);
+			int num2 = 0;
+			cudaMemcpy(&num2, &indices[n - 1], sizeof(int), cudaMemcpyDeviceToHost);
+			
+			cudaFree(bools);
+			cudaFree(indices);
+			cudaFree(i_aug);
+			cudaFree(o_aug);
+
+            return num1 + num2;
+        }
+    }
+}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
new file mode 100644
index 0000000..de31059
--- /dev/null
+++ b/stream_compaction/efficient.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Efficient {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compact(int n, PathSegment *odata, PathSegment*idata);
+    }
+}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
new file mode 100644
index 0000000..105f38c
--- /dev/null
+++ b/stream_compaction/naive.cu
@@ -0,0 +1,63 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "naive.h"
+#include "device_launch_parameters.h"
+
+#define blockSize 128
+
+
+namespace StreamCompaction {
+    namespace Naive {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+        // TODO: __global__
+		__global__ void naiveParallelScan(int n, int k, int *odata, int *idata) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+			if (index >= k) {
+				odata[index] = idata[index] + idata[index - k];
+			}
+			else {
+				odata[index] = idata[index];
+			}
+		}
+		__global__ void toExclusive(int n, int *odata, int *idata) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n) return;
+			if (index > 0) {
+				odata[index] = idata[index - 1];
+			}
+			else {
+				odata[index] = 0;
+			}
+		}
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+			int* inclusive;
+			int* exclusive;
+			cudaMalloc((int**)&inclusive, n * sizeof(int));
+			cudaMalloc((int**)&exclusive, n * sizeof(int));
+			cudaMemcpy(inclusive, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            timer().startGpuTimer();
+            // TODO
+			for (int d = 0; d <= ilog2ceil(n); d++) {
+				naiveParallelScan<<< fullBlocksPerGrid, blockSize >>>(n, pow(2.0, d), exclusive, inclusive);
+				// ping-pong
+				std::swap(exclusive, inclusive);
+			}
+			toExclusive<<< fullBlocksPerGrid, blockSize >>>(n, exclusive, inclusive);
+            timer().endGpuTimer();
+			cudaMemcpy(odata, exclusive, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(inclusive);
+			cudaFree(exclusive);
+        }
+    }
+}
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
new file mode 100644
index 0000000..37dcb06
--- /dev/null
+++ b/stream_compaction/naive.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Naive {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
new file mode 100644
index 0000000..9bd128d
--- /dev/null
+++ b/stream_compaction/thrust.cu
@@ -0,0 +1,36 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <thrust/device_vector.h>
+#include <thrust/host_vector.h>
+#include <thrust/scan.h>
+#include "common.h"
+#include "thrust.h"
+
+namespace StreamCompaction {
+    namespace Thrust {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+			thrust::device_vector<int> dv_in(idata, idata + n);
+			thrust::device_vector<int> dv_out(odata, odata + n);
+			
+			//timer().startGpuTimer();
+            // TODO use `thrust::exclusive_scan`
+            // example: for device_vectors dv_in and dv_out:
+            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            
+			//timer().endGpuTimer();
+			thrust::copy(dv_out.begin(), dv_out.end(), odata);
+
+			
+        }
+    }
+}
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
new file mode 100644
index 0000000..fe98206
--- /dev/null
+++ b/stream_compaction/thrust.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Thrust {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}