diff --git a/README.md b/README.md
index 6e02afa..58328fe 100644
--- a/README.md
+++ b/README.md
@@ -1,133 +1,10 @@
-Project-2
-=========
-
-A Study in Parallel Algorithms : Stream Compaction
-
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
-
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
-
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
-
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
-
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
+# PART 2: NAIVE PREFIX SUM
+As we can see from the graph, GPU naive scan is faster than CPU version when the length of the array is smaller than 1.2 million. And the time consumption for GPU naive version become worse when the array size increases.The reason for naive version become inefficient when array size is too big is because there are too many global memory access. 
 
 # PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
-
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
 
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
+As can see from the graph, GPU shared memory (green) is always faster than both GPU naive version (red) and CPU version (blue). In the experiment, I used 1024 block size. If block size is smaller, the GPU shared memory version can run much faster. The main reason for shared memory become not that efficient after 1 million is because the bank conflict starts to happen more frequently. So, some threads have to wait until the shared memory is ready to use. 
 
 # PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
-
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
-
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
-
-# SUBMISSION
-Please answer all the questions in each of the subsections above and write your
-answers in the README by overwriting the README file.  In future projects, we
-expect your analysis to be similar to the one we have led you through in this
-project.  Like other projects, please open a pull request and email Harmony.
 
-# REFERENCES
-"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
+According to the graph, the thrust version is the most inefficient version, and the GPU version is the most efficient version. The thrust version is that slow may due to the way I calculated clock time is in term of CPU instead of GPU. If thrust can handle the bank conflits well, then it should run faster than my GPU version. 
diff --git a/StreamCompaction/Debug/StreamCompaction.ilk b/StreamCompaction/Debug/StreamCompaction.ilk
new file mode 100644
index 0000000..440b07d
Binary files /dev/null and b/StreamCompaction/Debug/StreamCompaction.ilk differ
diff --git a/StreamCompaction/Debug/StreamCompaction.pdb b/StreamCompaction/Debug/StreamCompaction.pdb
new file mode 100644
index 0000000..b60ba13
Binary files /dev/null and b/StreamCompaction/Debug/StreamCompaction.pdb differ
diff --git a/StreamCompaction/StreamCompaction.sdf b/StreamCompaction/StreamCompaction.sdf
new file mode 100644
index 0000000..eb745a7
Binary files /dev/null and b/StreamCompaction/StreamCompaction.sdf differ
diff --git a/StreamCompaction/StreamCompaction.sln b/StreamCompaction/StreamCompaction.sln
new file mode 100644
index 0000000..d5442bd
--- /dev/null
+++ b/StreamCompaction/StreamCompaction.sln
@@ -0,0 +1,20 @@
+﻿
+Microsoft Visual Studio Solution File, Format Version 11.00
+# Visual Studio 2010
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "StreamCompaction", "StreamCompaction\StreamCompaction.vcxproj", "{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Release|Win32 = Release|Win32
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}.Debug|Win32.ActiveCfg = Debug|Win32
+		{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}.Debug|Win32.Build.0 = Debug|Win32
+		{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}.Release|Win32.ActiveCfg = Release|Win32
+		{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}.Release|Win32.Build.0 = Release|Win32
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
diff --git a/StreamCompaction/StreamCompaction.suo b/StreamCompaction/StreamCompaction.suo
new file mode 100644
index 0000000..1464817
Binary files /dev/null and b/StreamCompaction/StreamCompaction.suo differ
diff --git a/StreamCompaction/StreamCompaction/Debug/vc100.idb b/StreamCompaction/StreamCompaction/Debug/vc100.idb
new file mode 100644
index 0000000..bd233c6
Binary files /dev/null and b/StreamCompaction/StreamCompaction/Debug/vc100.idb differ
diff --git a/StreamCompaction/StreamCompaction/Debug/vc100.pdb b/StreamCompaction/StreamCompaction/Debug/vc100.pdb
new file mode 100644
index 0000000..34583b5
Binary files /dev/null and b/StreamCompaction/StreamCompaction/Debug/vc100.pdb differ
diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.cu b/StreamCompaction/StreamCompaction/StreamCompaction.cu
new file mode 100644
index 0000000..1969b45
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/StreamCompaction.cu
@@ -0,0 +1,284 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <iostream>
+#include <math.h>
+#include "StreamCompaction.h"
+
+using namespace std;
+// part 2: naive prefix sum
+__global__ void writeBuffer(float * in, float * out, int n)
+{
+	int index = blockDim.x * blockIdx.x +  threadIdx.x; 
+	if(index < n )
+		out[index] = in[index];
+}
+
+__global__ void naive_sum(float * in, float * out, int d, int n)
+{
+	int index = blockDim.x * blockIdx.x +  threadIdx.x;  
+
+	if (index >= d && index < n)
+		in[index] = out[index - d] + out[index];
+}
+
+__global__ void writeOutput(float * in, float * out, int n)
+{
+	int index = blockDim.x * blockIdx.x +  threadIdx.x; 
+	if(index < n)
+		out[index+1] = in[index];
+}
+
+
+// part 3a: shared sum with single block
+__global__ void single_shared_sum(float * in, float * out, int n)
+{
+	extern __shared__ float tmp[];
+
+	int index = threadIdx.x;  
+	tmp[index] = in[index]; 
+	__syncthreads();
+
+	if(index < n)
+	{
+		for(int i = 1; i < n; i *=2)
+		{
+			if (index >= i)
+			{
+				tmp[index] = tmp[index] + tmp[index-i];	
+			}
+			__syncthreads();
+		}
+	}
+	out[index+1] = tmp[index];
+}
+
+
+// part 3b: shared sum with arbitary array length
+__global__ void shared_sum( float * in,  float * out, int n, float * sum)
+{
+	extern __shared__ float tmp[];
+
+	int index = blockIdx.x * blockDim.x + threadIdx.x; 
+	if(index < n)
+		tmp[threadIdx.x] = in[index]; 
+
+	if(index < n)
+	{
+		for(int i = 1; i < blockDim.x; i *=2)
+		{
+			__syncthreads();
+			int j = (threadIdx.x + 1) * 2 * i -1;
+			if (j < blockDim.x)
+			{
+				tmp[j] +=  tmp[j-i];	
+			}
+		}
+
+		for(int i = BlockSize/4; i > 0; i /= 2)
+		{
+			__syncthreads();
+			int j = (threadIdx.x + 1) * 2 * i -1;
+			if (j + i < blockDim.x)
+			{
+				tmp[j + i] +=  tmp[j];	
+			}
+			
+		}
+	}
+
+	__syncthreads();
+	out[index+1] = tmp[threadIdx.x];
+
+	if(threadIdx.x == 0)
+		sum[blockIdx.x] = tmp[BlockSize - 1];
+}
+
+__global__ void add_sum( float * out, int n, float * sum)
+{
+	int index = blockIdx.x * blockDim.x + threadIdx.x;
+	if(index < n && blockIdx.x > 0)
+		out[index] += sum[blockIdx.x];
+}
+
+// part 4: scatter and stream compact
+__global__ void generateBoolArray(float * in, float * out, int n)
+{
+	int index = blockIdx.x * blockDim.x + threadIdx.x;
+	if(index < n)
+	{
+		if(in[index] > 0)
+			out[index] = 1;
+		else
+			out[index] = 0;
+	}
+}
+
+__global__ void scatter(float * in, float * indexArray, float * out, int n)
+{
+	int index = blockIdx.x * blockDim.x + threadIdx.x;
+	if(index < n)
+	{	
+		if(in[index] > 0)
+		{
+			int tmp = indexArray[index];
+			out[tmp] = in[index];
+		}
+	}
+}
+
+
+// wrapper fuctions
+
+void GPU_naive_prefix_sum(float * in, float * out, int n)
+{
+	float * dev_in;
+	float * dev_out;
+	cudaMalloc((void**)&dev_in, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_out, n*(sizeof(float)));
+
+	cudaMemcpy(dev_in, in, n * sizeof(float), cudaMemcpyHostToDevice);
+	dim3 fullBlocksPerGrid((int)ceil(float(n)/float(BlockSize)));
+
+	cudaEvent_t start, stop;
+	float time;
+
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+	cudaEventRecord( start, 0 );
+
+	for(int i = 1; i < n ; i *= 2)
+	{
+		writeBuffer<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_out, n);
+		naive_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_out, i, n);
+	}
+	writeOutput<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_out, n);
+
+
+	cudaEventRecord( stop, 0 ); 
+	cudaEventSynchronize( stop ); 
+	cudaEventElapsedTime( &time, start, stop ); 
+	cudaEventDestroy( start ); 
+	cudaEventDestroy( stop );
+	cout << "#GPU naive prefix sum performance:  "<< time << " ms" << endl;
+
+	cudaMemcpy(out, dev_out, n * sizeof(float), cudaMemcpyDeviceToHost);
+	out[0] = 0;
+	cudaFree(dev_in);
+	cudaFree(dev_out);
+}
+
+void GPU_single_shared_prefix_sum(float * in, float * out, int n)
+{
+	float * dev_in;
+	float * dev_out;
+	cudaMalloc((void**)&dev_in, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_out, n*(sizeof(float)));
+
+	cudaMemcpy(dev_in, in, n * sizeof(float), cudaMemcpyHostToDevice);
+
+	single_shared_sum<<<1, n, n * sizeof(float)>>>(dev_in, dev_out, n);
+
+	cudaMemcpy(out, dev_out, n * sizeof(float), cudaMemcpyDeviceToHost);
+	out[0] = 0;
+	cudaFree(dev_in);
+	cudaFree(dev_out);
+}
+
+void scan_sum(float * in, float * out, int n)
+{
+	dim3 fullBlocksPerGrid((int)ceil(float(n)/float(BlockSize)));
+
+	for(int i = 1; i < n ; i *= 2)
+	{
+		writeBuffer<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(in, out, n);
+		naive_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(in, out, i, n);
+	}
+	writeOutput<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(in,  out, n);
+}
+
+void GPU_shared_prefix_sum(float * in, float * out, int n)
+{
+	float * dev_in;
+	float * dev_out;
+	float * dev_sum;
+	int grid_size = (int)ceil(float(n)/float(BlockSize));
+
+	cudaMalloc((void**)&dev_in, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_out, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_sum, grid_size * (sizeof(float)));
+
+	cudaMemcpy(dev_in, in, n * sizeof(float), cudaMemcpyHostToDevice);
+	dim3 fullBlocksPerGrid(grid_size);
+
+	cudaEvent_t start, stop;
+	float time;
+
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+	cudaEventRecord( start, 0 );
+
+	shared_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_out, n, dev_sum);
+	scan_sum(dev_sum, dev_sum, grid_size);
+	add_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_out,n,dev_sum);
+
+	cudaEventRecord( stop, 0 ); 
+	cudaEventSynchronize( stop ); 
+	cudaEventElapsedTime( &time, start, stop ); 
+	cudaEventDestroy( start ); 
+	cudaEventDestroy( stop );
+	cout << "#GPU shared prefix sum performance:  "<< time << " ms" << endl;
+
+	cudaMemcpy(out, dev_out, n * sizeof(float), cudaMemcpyDeviceToHost);
+	out[0] = 0;
+
+	cudaFree(dev_in);
+	cudaFree(dev_out);
+	cudaFree(dev_sum);
+}
+
+void GPU_stream_compact(float * in, float * out, int n)
+{
+	float * dev_in;
+	float * dev_out;
+	float * dev_boolArray;
+	float * dev_sum;
+	int grid_size = (int)ceil(float(n)/float(BlockSize));
+
+	cudaMalloc((void**)&dev_in, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_out, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_boolArray, n*(sizeof(float)));
+	cudaMalloc((void**)&dev_sum, grid_size * (sizeof(float)));
+	cudaMemcpy(dev_in, in, n * sizeof(float), cudaMemcpyHostToDevice);
+	dim3 fullBlocksPerGrid(grid_size);
+
+	cudaEvent_t start, stop;
+	float time;
+
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+	cudaEventRecord( start, 0 );
+
+	generateBoolArray<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_boolArray, n);
+
+	shared_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_boolArray, dev_boolArray, n, dev_sum);
+	scan_sum(dev_sum, dev_sum, grid_size);
+	add_sum<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_boolArray,n,dev_sum);
+
+	scatter<<<fullBlocksPerGrid, BlockSize, BlockSize * sizeof(float)>>>(dev_in, dev_boolArray, dev_out, n);
+
+
+	cudaEventRecord( stop, 0 ); 
+	cudaEventSynchronize( stop ); 
+	cudaEventElapsedTime( &time, start, stop ); 
+	cudaEventDestroy( start ); 
+	cudaEventDestroy( stop );
+	cout << "#GPU stream compact performance:  "<< time << " ms" << endl;
+
+	cudaMemcpy(out, dev_out, n * sizeof(float), cudaMemcpyDeviceToHost);
+
+	cudaFree(dev_in);
+	cudaFree(dev_out);
+	cudaFree(dev_sum);
+	cudaFree(dev_boolArray);
+}
+
diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.h b/StreamCompaction/StreamCompaction/StreamCompaction.h
new file mode 100644
index 0000000..d276594
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/StreamCompaction.h
@@ -0,0 +1,13 @@
+#ifndef STREAM_H
+#define STREAM_H
+
+#include <cuda.h>
+#include <stdio.h>
+
+#define BlockSize 1024
+
+void GPU_naive_prefix_sum(float * in, float * out, int n);
+void GPU_single_shared_prefix_sum(float * in, float * out, int n);
+void GPU_shared_prefix_sum(float * in, float * out, int n);
+void GPU_stream_compact(float * in, float * out, int n);
+#endif
\ No newline at end of file
diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj
new file mode 100644
index 0000000..9583624
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj
@@ -0,0 +1,80 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{BDE1CCCD-C792-40EB-B80B-AA2721029CEA}</ProjectGuid>
+    <RootNamespace>StreamCompaction</RootNamespace>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup />
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+    </ClCompile>
+    <Link>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;cudart.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <CudaCompile>
+      <Include>$(CudaToolkitIncludeDir)</Include>
+    </CudaCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+    </ClCompile>
+    <Link>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <CudaCompile Include="StreamCompaction.cu" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="StreamCompaction.h" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+    <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters
new file mode 100644
index 0000000..3dc46ce
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.filters
@@ -0,0 +1,32 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <CudaCompile Include="StreamCompaction.cu">
+      <Filter>Source Files</Filter>
+    </CudaCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="StreamCompaction.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+</Project>
\ No newline at end of file
diff --git a/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.user b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.user
new file mode 100644
index 0000000..ace9a86
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/StreamCompaction.vcxproj.user
@@ -0,0 +1,3 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+</Project>
\ No newline at end of file
diff --git a/StreamCompaction/StreamCompaction/main.cpp b/StreamCompaction/StreamCompaction/main.cpp
new file mode 100644
index 0000000..462dfd6
--- /dev/null
+++ b/StreamCompaction/StreamCompaction/main.cpp
@@ -0,0 +1,154 @@
+#include <iostream>
+#include <stdlib.h>
+#include "StreamCompaction.h"
+#include <cuda_runtime.h>
+#include <cuda.h>
+#include <thrust\copy.h>
+#include <time.h>
+
+using namespace std;
+
+const int NUM = 10000;
+const int SINGLE_SHARED_NUM = 500;
+
+void CPU_prefix_sum(float * in, float * out, int n)
+{
+	out[0] = 0;
+	for(int i = 1; i < n; ++i)
+	{
+		out[i] = out[i-1] + in[i-1];
+	}
+}
+
+void CPU_scatter(float * in, float * out, int n)
+{
+	float * tmp = new float[n];
+	for(int i = 0; i < n; ++i)
+	{
+		if(in[i]> 0)
+			tmp[i] = 1;
+		else
+			tmp[i] = 0;
+	}
+
+	CPU_prefix_sum(tmp,out,n);
+}
+
+float * CPU_stream_compact(float * in, int n)
+{
+	float * tmp = new float[n];
+	int compact_size = 0;
+	CPU_scatter(in,tmp,n);
+
+	compact_size = tmp[n-1] + 1;
+	float * ret = new float[compact_size];
+
+
+	for(int i = 0; i < n; ++i)
+	{
+		if(in[i] > 0)
+		{
+			int index = tmp[i];
+			ret[index] = in[i];
+		}
+	}
+	return ret;
+}
+
+__device__ __host__ bool isZero(const float x)
+{
+	if(x>0)
+		return true;
+	else
+		return false;
+}
+
+float * thrust_tream_compact(float * in, int n)
+{
+	float * tmp = new float[n];
+
+	thrust::copy_if(in,in+n,tmp,isZero);
+
+
+	return tmp;
+}
+
+int main()
+{
+	double cpu_start;
+	double cpu_stop;
+
+	float * out1 = new float[NUM];
+	float * out2 = new float[NUM];
+	float * out3 = new float[NUM];
+	float * out4 = new float[NUM];
+	float * in = new float[NUM];
+
+	for( int i = 0; i < NUM; i++)
+	{
+		in[i] = 2;
+	}
+
+	//1: serial prefix sum
+	cpu_start = clock();
+	CPU_prefix_sum(in, out1, NUM);
+	cpu_stop = clock();
+	printf ("#CPU prefix sum performance:  %f ms.\n",1000.0* (double)((cpu_stop-cpu_start)/(double)CLOCKS_PER_SEC));
+
+	//for(int i = 0; i < NUM ; ++i)
+	//cout << out1[NUM-1] <<  endl;
+	
+	//2: naive parallel prefix sum
+
+	GPU_naive_prefix_sum(in, out2, NUM);
+
+	//for(int i = 0; i < NUM ; ++i)
+	//cout << out2[NUM-1]  << endl;
+
+	//3a: parallel prefix sum with shared memory for a single block
+	//single_shared_prefix_sum(in, out3,  SINGLE_SHARED_NUM);
+	//for(int i = 0; i < SINGLE_SHARED_NUM ; ++i)
+	//cout << out3[SINGLE_SHARED_NUM-1] << endl;
+
+	//3b: parallel prefix sum with shared memory for arbitary array length
+
+	GPU_shared_prefix_sum(in,out4,NUM);
+
+
+	//for(int i = 0; i < NUM ; ++i)
+	//cout << out4[NUM-1] << endl;
+	
+	//test CPU stream compact
+	cpu_start = clock();
+	float * out5 = CPU_stream_compact( in, NUM);
+	cpu_stop = clock();
+	printf ("#CPU stream compact performance:  %f ms.\n",1000.0*(double)((cpu_stop-cpu_start)/(double)CLOCKS_PER_SEC));
+	/*int i = 0;
+	while(out5[i]>0)
+	{
+		cout << out5[i] << endl;
+		i++;
+	}*/
+
+	// test GPU stream compact
+	float * out6 = new float[NUM];
+	GPU_stream_compact(in, out6,NUM);
+
+	// test thrust version
+	cpu_start = clock();
+	float * out7 = thrust_tream_compact(in, NUM);
+	cpu_stop = clock();
+	printf ("#Thrust stream compact performance:  %f ms.\n",1000.0*(double)((cpu_stop-cpu_start)/(double)CLOCKS_PER_SEC));
+
+
+	free(out1);
+	free(out2);
+	free(out3);
+	free(out4);
+	free(out5);
+	free(out6);
+	free(out7);
+	free(in);
+	int c;
+	cin >> c;
+}
\ No newline at end of file
diff --git a/StreamCompaction/StreamCompaction/vc100.pdb b/StreamCompaction/StreamCompaction/vc100.pdb
new file mode 100644
index 0000000..3304277
Binary files /dev/null and b/StreamCompaction/StreamCompaction/vc100.pdb differ
diff --git a/StreamCompaction/ipch/streamcompaction-d9f77619/streamcompaction-924485.ipch b/StreamCompaction/ipch/streamcompaction-d9f77619/streamcompaction-924485.ipch
new file mode 100644
index 0000000..3f5b73c
Binary files /dev/null and b/StreamCompaction/ipch/streamcompaction-d9f77619/streamcompaction-924485.ipch differ
diff --git a/prefix_sum.jpg b/prefix_sum.jpg
new file mode 100644
index 0000000..c7a8d1d
Binary files /dev/null and b/prefix_sum.jpg differ
diff --git a/stream_compact.jpg b/stream_compact.jpg
new file mode 100644
index 0000000..687b86e
Binary files /dev/null and b/stream_compact.jpg differ