CIS565-Fall-2014 · jianqiaol · Sep 29, 2014 · Sep 29, 2014 · Sep 29, 2014 · Sep 29, 2014
diff --git a/README.md b/README.md
@@ -3,125 +3,17 @@ Project-2
 
 A Study in Parallel Algorithms : Stream Compaction
 
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
 
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
+Comparison of CPU and GPU version for part 2 and part 3:
+![](https://raw.githubusercontent.com/jianqiaol/Project2-StreamCompaction/master/project2.png)
 
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
+Analysis: The CPU version of scan is actually very fast. I think me implementation of GPU version is not optimal at all.Still the one with shared memory works better than naive implementation when input size increases. And there are lots of bank conflicts. My function has wired behavior. With same input, it returns different output. Half times it gets the right answer, other time it just goes wrong from some points. I am still trying to figure out why. For naive implementation this only happens when input size is larger than the block size. I think this is caused by the threads cannot synchronize across blocks. But for the one using shared memory, even with input size smaller than block size, it still only working half time...
 
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
 
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
-
-# PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
-
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
-
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
 
 # PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
-
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
-
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
+Since my scan is not always working, I cannot verify my scatter. But I think scatter itself is not hard if you have implemented scan. I haven't be able to use thrust, but I guess thrust's stream compact should be better. For my code, I would start optimize it from fix bank conflicts and implement the balanced tree algorithm. 
+
 
 # SUBMISSION
 Please answer all the questions in each of the subsections above and write your

diff --git a/StreamCompaction/Debug/StreamCompaction.ilk b/StreamCompaction/Debug/StreamCompaction.ilk
diff --git a/StreamCompaction/Debug/StreamCompaction.pdb b/StreamCompaction/Debug/StreamCompaction.pdb
diff --git a/StreamCompaction/Release/StreamCompaction.pdb b/StreamCompaction/Release/StreamCompaction.pdb
diff --git a/StreamCompaction/StreamCompaction.sln b/StreamCompaction/StreamCompaction.sln
@@ -0,0 +1,26 @@
+
+Microsoft Visual Studio Solution File, Format Version 11.00
+# Visual Studio 2010
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "StreamCompaction", "StreamCompaction\StreamCompaction.vcxproj", "{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Debug|x64 = Debug|x64
+		Release|Win32 = Release|Win32
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Debug|Win32.ActiveCfg = Debug|Win32
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Debug|Win32.Build.0 = Debug|Win32
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Debug|x64.ActiveCfg = Debug|x64
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Debug|x64.Build.0 = Debug|x64
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Release|Win32.ActiveCfg = Release|Win32
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Release|Win32.Build.0 = Release|Win32
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Release|x64.ActiveCfg = Release|x64
+		{9D07329B-B2F5-4BD4-B48C-1BA049B34AB0}.Release|x64.Build.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
diff --git a/StreamCompaction/StreamCompaction.suo b/StreamCompaction/StreamCompaction.suo
diff --git a/StreamCompaction/StreamCompaction/CPU_StreamCompaction.h b/StreamCompaction/StreamCompaction/CPU_StreamCompaction.h
@@ -0,0 +1,52 @@
+#ifndef CPU_STREAMCOMPACTION_H_
+#define CPU_STREAMCOMPACTION_H_
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctime>
+#include <iostream>
+#include <math.h>
+#include <cstdlib>
+using namespace std;
+
+//Check the result, this function is from shehzan's code in profiling and debugging lab
+void postprocess(const int *ref, const int *res, int n)
+{
+    bool passed = true;
+    for (int i = 0; i < n; i++)
+    {
+        if (res[i] != ref[i])
+        {
+            printf("ID:%d \t Res:%d \t Ref:%d\n", i, res[i], ref[i]);
+			for(int j=0;j<n;j++)
+				cout<<ref[j]<<" "<<res[j]<<endl;
+            printf("%25s\n", "*** FAILED ***");
+            passed = false;
+            break;
+        }
+    }
+    if(passed)
+        printf("Post process check passed!!\n");
+}
+
+
+void scan_CPU(int *input,int *output,int n)
+{
+	output[0]=0;
+	for(int i=1;i<n;i++)
+		output[i]=output[i-1]+input[i-1];
+}
+
+void scatter_CPU(int *input,int *output, int n)
+{
+	int k=0;
+	for(int i=0;i<n;i++)
+	{
+		if(input[i]!=0)
+		{
+			output[k]=input[i];
+			k+=1;
+		}
+	}
+
+}
+#endif
diff --git a/StreamCompaction/StreamCompaction/Debug/CL.read.1.tlog b/StreamCompaction/StreamCompaction/Debug/CL.read.1.tlog
diff --git a/StreamCompaction/StreamCompaction/Debug/CL.write.1.tlog b/StreamCompaction/StreamCompaction/Debug/CL.write.1.tlog
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.embed.manifest b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.embed.manifest
@@ -0,0 +1,10 @@
+<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
+  <trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">
+    <security>
+      <requestedPrivileges>
+        <requestedExecutionLevel level="asInvoker" uiAccess="false"></requestedExecutionLevel>
+      </requestedPrivileges>
+    </security>
+  </trustInfo>
+</assembly>
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.embed.manifest.res b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.embed.manifest.res
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.intermediate.manifest b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.exe.intermediate.manifest
@@ -0,0 +1,10 @@
+<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
+<assembly xmlns='urn:schemas-microsoft-com:asm.v1' manifestVersion='1.0'>
+  <trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">
+    <security>
+      <requestedPrivileges>
+        <requestedExecutionLevel level='asInvoker' uiAccess='false' />
+      </requestedPrivileges>
+    </security>
+  </trustInfo>
+</assembly>
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.lastbuildstate b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.lastbuildstate
@@ -0,0 +1,2 @@
+#v4.0:v100:false
+Debug|Win32|S:\CIS565\Project2-StreamCompaction\StreamCompaction\|
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.log b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.log
@@ -0,0 +1,45 @@
+Build started 9/28/2014 5:22:02 PM.
+     1>Project "S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\StreamCompaction.vcxproj" on node 2 (build target(s)).
+     1>InitializeBuildStatus:
+         Creating "Debug\StreamCompaction.unsuccessfulbuild" because "AlwaysCreate" was specified.
+       AddCudaCompileDeps:
+         c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\cl.exe /E /nologo /showIncludes /TP /D__CUDACC__ /DWIN32 /D_DEBUG /D_CONSOLE /D_MBCS /I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" /I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin" /I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" /I. /FIcuda_runtime.h /c S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu 
+       AddCudaCompilePropsDeps:
+       Skipping target "AddCudaCompilePropsDeps" because all output files are up-to-date with respect to the input files.
+       CudaBuild:
+         Compiling CUDA source file kernel.cu...
+         cmd.exe /C "C:\Users\jianqiao\AppData\Local\Temp\tmp940f4c716ea0415687005cb1195dc6fd.cmd"
+         "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -o S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu.obj "S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu"
+
+         C:\user>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -o S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu.obj "S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu" 
+     1>S:/CIS565/Project2-StreamCompaction/StreamCompaction/StreamCompaction/kernel.cu(77): warning C4244: 'initializing' : conversion from 'clock_t' to 'float', possible loss of data
+       ClCompile:
+         All outputs are up-to-date.
+       ManifestResourceCompile:
+         All outputs are up-to-date.
+       Link:
+         c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\link.exe /ERRORREPORT:PROMPT /OUT:"S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\StreamCompaction.exe" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\lib\Win32" cudart.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /MANIFEST /ManifestFile:"Debug\StreamCompaction.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\StreamCompaction.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\StreamCompaction.lib" /MACHINE:X86 "S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu.obj"
+         Debug\StreamCompaction.exe.embed.manifest.res
+         Debug\main.obj
+       Manifest:
+         C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\bin\mt.exe /nologo /verbose /out:"Debug\StreamCompaction.exe.embed.manifest" /manifest Debug\StreamCompaction.exe.intermediate.manifest
+         All outputs are up-to-date.
+       LinkEmbedManifest:
+         All outputs are up-to-date.
+         StreamCompaction.vcxproj -> S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\StreamCompaction.exe
+       PostBuildEvent:
+         echo copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\cudart*.dll" "S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\"
+         copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\cudart*.dll" "S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\"
+         :VCEnd
+         copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\cudart*.dll" "S:\CIS565\Project2-StreamCompaction\StreamCompaction\Debug\"
+         C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\cudart32_55.dll
+         C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\cudart64_55.dll
+                 2 file(s) copied.
+       FinalizeBuildStatus:
+         Deleting file "Debug\StreamCompaction.unsuccessfulbuild".
+         Touching "Debug\StreamCompaction.lastbuildstate".
+     1>Done Building Project "S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\StreamCompaction.vcxproj" (build target(s)).
+
+Build succeeded.
+
+Time Elapsed 00:00:09.17
diff --git a/...mCompaction/StreamCompaction/Debug/StreamCompaction.vcxprojResolveAssemblyReference.cache b/...mCompaction/StreamCompaction/Debug/StreamCompaction.vcxprojResolveAssemblyReference.cache
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction.write.1.tlog b/StreamCompaction/StreamCompaction/Debug/StreamCompaction.write.1.tlog
diff --git a/StreamCompaction/StreamCompaction/Debug/StreamCompaction_manifest.rc b/StreamCompaction/StreamCompaction/Debug/StreamCompaction_manifest.rc
diff --git a/StreamCompaction/StreamCompaction/Debug/cl.command.1.tlog b/StreamCompaction/StreamCompaction/Debug/cl.command.1.tlog
diff --git a/StreamCompaction/StreamCompaction/Debug/kernel.cu.cache b/StreamCompaction/StreamCompaction/Debug/kernel.cu.cache
@@ -0,0 +1,49 @@
+Identity=kernel.cu
+AdditionalCompilerOptions=
+AdditionalCompilerOptions=
+AdditionalDependencies=
+AdditionalDeps=
+AdditionalLibraryDirectories=
+AdditionalOptions=
+AdditionalOptions=
+CInterleavedPTX=false
+CodeGeneration=compute_10,sm_10
+CodeGeneration=compute_10,sm_10
+CompileOut=S:\CIS565\Project2-StreamCompaction\StreamCompaction\StreamCompaction\kernel.cu.obj
+CudaRuntime=Static
+CudaToolkitCustomDir=
+Defines=;WIN32;_DEBUG;_CONSOLE;_MBCS;
+Emulation=false
+FastMath=false
+GenerateLineInfo=false
+GenerateRelocatableDeviceCode=false
+GPUDebugInfo=true
+GPUDebugInfo=true
+HostDebugInfo=true
+Include=;;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include
+Inputs=
+Keep=false
+KeepDir=Debug
+LinkOut=
+MaxRegCount=0
+NvccCompilation=compile
+NvccPath=
+Optimization=Od
+Optimization=Od
+PerformDeviceLink=
+PtxAsOptionV=false
+RequiredIncludes=
+Runtime=MDd
+Runtime=MDd
+RuntimeChecks=RTC1
+RuntimeChecks=RTC1
+TargetMachinePlatform=32
+TargetMachinePlatform=32
+TypeInfo=
+TypeInfo=
+UseHostDefines=true
+UseHostInclude=true
+UseHostLibraryDependencies=
+UseHostLibraryDirectories=
+Warning=W3
+Warning=W3
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#v4.0:v100:false
		Debug\|Win32\|S:\CIS565\Project2-StreamCompaction\StreamCompaction\\|