Skip to content
Open
171 changes: 165 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,171 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* SPENCER WEBSTER-BASS
* [LinkedIn](https://www.linkedin.com/in/spencer-webster-bass/)
* Tested on: Windows 10, i7-6700 @ 3.40GHz 16GB, Quadro P1000 222MB (MOR100B-19)

### (TODO: Your README)
### DESCRIPTION

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project is an implementation of the stream compaction parallel algorithm on the GPU using CUDA and C++.

Features:
* Serial implementation of scan and stream compaction algorithms on the CPU
* Naive, parallel implementation of scan and stream compaction algorithms on the GPU
* Atepted work-efficient, parallel implementation of scan and stream compaction algorithms on the GPU
* Comparison between my implementations' efficiency and thrust's implementation of exclusive scan algorithm

TODOs:
Include Analysis

****************
** SCAN TESTS **
****************
[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

==== cpu scan, power-of-two ====

elapsed time: 0ms (std::chrono Measured)

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]

==== cpu scan, non-power-of-two ====

elapsed time: 0ms (std::chrono Measured)

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 ]

passed

==== naive scan, power-of-two ====

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

[ 0 49 57 10 29 47 64 65 48 76 52 23 23 19 19 34 ]

[ 0 49 57 59 86 57 93 112 112 141 100 99 75 42 42 53 ]

[ 0 49 57 59 86 106 150 171 198 198 193 211 187 183 142 152 ]

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]

elapsed time: 7.58922ms (CUDA Measured)

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]

passed

==== 1s array for finding bugs ====

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

[ 0 49 57 10 29 47 64 65 48 76 52 23 23 19 19 34 ]

[ 0 49 57 59 86 57 93 112 112 141 100 99 75 42 42 53 ]

[ 0 49 57 59 86 106 150 171 198 198 193 211 187 183 142 152 ]

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 289 292 323 ]

==== naive scan, non-power-of-two ====

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 3 31 0 ]

[ 49 8 2 27 20 44 21 27 49 3 20 3 16 ]

[ 0 49 57 10 29 47 64 65 48 76 52 23 23 3 0 0 ]

[ 0 49 57 59 86 57 93 112 112 141 100 99 75 26 23 3 ]

[ 0 49 57 59 86 106 150 171 198 198 193 211 187 167 123 102 ]

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 273 273 273 ]

elapsed time: 15.0825ms (CUDA Measured)

[ 0 49 57 59 86 106 150 171 198 247 250 270 273 0 0 0 ]

passed

==== work-efficient scan, power-of-two ====

elapsed time: 0ms (CUDA Measured)

a[1] = 49, b[1] = 0

FAIL VALUE

==== work-efficient scan, non-power-of-two ====

elapsed time: 0ms (CUDA Measured)

a[1] = 49, b[1] = 0

FAIL VALUE

==== thrust scan, power-of-two ====

elapsed time: 0.083008ms (CUDA Measured)

passed

==== thrust scan, non-power-of-two ====

elapsed time: 0.069632ms (CUDA Measured)

passed

*****************************
** STREAM COMPACTION TESTS **
*****************************

[ 1 2 2 1 2 0 1 3 1 3 0 3 0 3 3 0 ]

==== cpu compact without scan, power-of-two ====

elapsed time: 0.0034ms (std::chrono Measured)

[ 0 0 0 0 0 0 0 0 0 0 0 0 ]

passed

==== cpu compact without scan, non-power-of-two ====

elapsed time: 0.004ms (std::chrono Measured)

[ 0 0 0 0 0 0 0 0 0 0 ]

passed

==== cpu compact with scan ====

elapsed time: 0.0023ms (std::chrono Measured)

[ ]

expected 12 elements, got -1

FAIL COUNT

==== work-efficient compact, power-of-two ====

elapsed time: 0ms (CUDA Measured)

expected 12 elements, got -1

FAIL COUNT

==== work-efficient compact, non-power-of-two ====

elapsed time: 0ms (CUDA Measured)

expected 10 elements, got -1

FAIL COUNT

34 changes: 34 additions & 0 deletions notes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
How does dim3 work? and dimensions in CUDA. Does dim3 set unspecified arguments to 1 or 0?
Does CUDA expect unused dimensions to be 1 or 0?



How do these checkCUDAError work? They sometimes say that an entire is occurring at an incorrect location if
I dont have one at every CUDA function call.



Do we need to include new functions in header files?



Inside of CUDA files are we using c or C++

Dont have classes. The coding style might be closer to C instead of C++. You can pass structs to CUDA kernels.

When using the memory window is it showing you gpu or cpu memory when you copy and paste an address
from the locals or autos window?

Can memcpy from the device back to the host.
And start with a smaller sized buffer so that you can check the values in the buffers by hand.


dev_data1, dev_data2;
// cudamalloc, memcpy, etc
// for eah iteration, launch kernels on dev_data1 and dev_data2
int* temp = dev_data1
dev_dta1 = dev_data2
dev_data2 = temp


The weird alternating thing where every other value was zero was due to the book's funky way of ping ponging.
13 changes: 8 additions & 5 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 4; //8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -49,22 +49,25 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
printArray(SIZE, a, true);
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan */
onesArray(SIZE, c);
printDesc("1s array for finding bugs");
printArray(SIZE, a, true);
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */
printArray(SIZE, c, true);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
printArray(SIZE, a, true);
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down
12 changes: 12 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
return;
if (idata[index])
bools[index] = 1;
else
bools[index] = 0;
}

/**
Expand All @@ -33,6 +40,11 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
return;
if (bools[index])
odata[indices[index]] = idata[index];
}

}
Expand Down
45 changes: 42 additions & 3 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include <cstdio>
#include <vector>
#include "cpu.h"

#include "common.h"
Expand All @@ -18,9 +19,12 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// timer().startCpuTimer();
// TODO
timer().endCpuTimer();
odata[0] = 0;
for (int i = 1; i < n; i++)
odata[i] = odata[i - 1] + idata[i - 1];
// timer().endCpuTimer();
}

/**
Expand All @@ -31,8 +35,13 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
std::vector<int> o = std::vector<int>();
for (int i = 0; i < n; i++)
if (idata[i])
o.push_back(idata[i]);
odata = o.data();
timer().endCpuTimer();
return -1;
return o.size();
}

/**
Expand All @@ -43,8 +52,38 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// step 1: compute bit mask
std::vector<int> mask(n);
for (int i = 0; i < n; i++) {
if (idata[i] == 0) {
mask.at(i) = 0;
}
else {
mask.at(i) = 1;
}
}

// step 2: exclusive scan
scan(n, odata, mask.data());

timer().endCpuTimer();
return -1;

// step 3: scatter
int m = odata[n - 1];
std::vector<int> ovec(m);
m = 0;
for (int i = 0; i < n; i++) {
if (mask[i]) {
ovec[odata[i]] = idata[i];
m++;
}
}

odata = ovec.data();

timer().endCpuTimer();
return m;
}
}
}
Loading