Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 127 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,133 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Jilin Liu
* [LinkedIn](https://www.linkedin.com/in/jilin-liu97/), [twitter](https://twitter.com/Jilin18043110).
* Tested on: Windows 10, i7-8750H @ 2.20GHz, 16GB, GTX 1050Ti 4096MB (personal)

### (TODO: Your README)
## Features and Results

This project includes several parallel algorithms such as exclusive scan and stream compaction. These algorithms can be very useful as independent components of later projects like GPU based path tracer. A serial CPU version of these algorithms is also implemented and used as a performance comparison baseline.

Features:
1. CPU Scan
2. CPU Stream Compact
3. Naive GPU Scan
4. Work-Efficient Scan
5. Work-Efficient Stream Compact

Extra Credit:
1. Faster GPU Implementation

```
****************
** SCAN TESTS **
****************
[ 30 44 27 31 9 34 15 27 24 39 44 33 38 ... 18 0 ]
==== cpu scan, power-of-two ====
elapsed time: 17.3105ms (std::chrono Measured)
[ 0 30 74 101 132 141 175 190 217 241 280 324 357 ... 102764441 102764459 ]
==== cpu scan, non-power-of-two ====
elapsed time: 2.102ms (std::chrono Measured)
[ 0 30 74 101 132 141 175 190 217 241 280 324 357 ... 102764362 102764372 ]
passed
==== naive scan, power-of-two ====
elapsed time: 8.62189ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 8.62202ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 3.79942ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 3.77037ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.560736ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.639808ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 0 1 3 3 0 1 1 2 3 0 3 2 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 9.2661ms (std::chrono Measured)
[ 1 3 3 1 1 2 3 3 2 3 2 3 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 9.253ms (std::chrono Measured)
[ 1 3 3 1 1 2 3 3 2 3 2 3 3 ... 2 2 ]
passed
==== cpu compact with scan ====
elapsed time: 24.8881ms (std::chrono Measured)
[ 1 3 3 1 1 2 3 3 2 3 2 3 3 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 5.13638ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 5.15882ms (CUDA Measured)
passed
```

## Performance Analysis

A more detailed performance comparison of exclusive scan with respect to the array size is shown below.

![](./images/comp1.JPG)

![](./images/comp2.JPG)

The thrust implementation has the best scalability and runs about 10 times faster than my work-efficient scan. But its execution time still follows a linear increase with respect to the array size. My hypothesis about its efficiency is that it's using shared memory.

In the timeline below, we can easily spot the bottleneck of my implementation. As I precluded the time measurement of device-to-host and host-to-device memory copy, the majority execution time comes from the calling of up-sweep kernel function and down-sweep kernel function. The up-sweep and down-sweep in total take 5 times longer than scatter and 10 times longer than map-to-boolean.

![](./images/t.JPG)

![](./images/t2.JPG)

The thrust scan seems to have two stage as well, as you can see below. But there is an noticable gap between two kernel functions. My guess is that they are packing memory into shared memories so the access can be much faster.

![](./images/thrust.JPG)

In each step of up-sweep and down-sweep, I only used those threads that are necessary to the algorithm. Since in each step we are only accessing and modifying a subset of the array elements, so the number of threads we need is proportional to the size of this subset. This trick eliminates a large number of lazy threads and thus makes the program run faster.

When the array has a trivial size, i.e. 256 elements, the CPU version runs faster(even faster than thrust implementation). The reason is that GPU has a overhead of packing and scheduling threads, which overweighs the benefits from parrallelism when the array size is small, as you can see below.

```
****************
** SCAN TESTS **
****************
[ 38 13 31 10 45 3 41 44 26 12 44 28 12 ... 10 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0005ms (std::chrono Measured)
[ 0 38 51 82 92 137 140 181 225 251 263 307 335 ... 6247 6257 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0004ms (std::chrono Measured)
[ 0 38 51 82 92 137 140 181 225 251 263 307 335 ... 6169 6198 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.017408ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.017408ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.050176ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.044032ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.04448ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.041984ms (CUDA Measured)
passed
```

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added images/comp1.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/comp2.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/t.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/t2.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/thrust.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 22; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand All @@ -34,6 +34,7 @@ int main(int argc, char* argv[]) {
// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
Expand Down
12 changes: 12 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= n) {
return;
}
bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
Expand All @@ -33,6 +38,13 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= n) {
return;
}
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
44 changes: 42 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int sum = 0; // identity
for (int i = 0; i < n; i++) {
odata[i] = sum;
sum += idata[i];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +36,14 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int p = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) { // remove 0
odata[p++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return p;
}

/**
Expand All @@ -43,8 +54,37 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
if (n < 1) {
return 0;
}
int* e = new int[n];
for (int i = 0; i < n; i++) {
e[i] = idata[i] != 0;
}

// scan
int* prefix = new int[n];
int sum = 0;
for (int i = 0; i < n; i++) {
prefix[i] = sum;
sum += e[i];
}

// scatter
for (int i = 0; i < n - 1; i++) {
if (prefix[i] != prefix[i + 1]) {
odata[prefix[i]] = idata[i];
}
}
int len = prefix[n - 1];
if (e[n - 1] == 1) {
len++;
odata[prefix[n - 1]] = idata[n - 1];
}
delete[] e;
delete[] prefix;
timer().endCpuTimer();
return -1;
return len;
}
}
}
Loading