Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 139 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,145 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Qiaosen Chen
* [LinkedIn](https://www.linkedin.com/in/qiaosen-chen-725699141/), etc.
* Tested on: Windows 10, i5-9400 @ 2.90GHz 16GB, GeForce RTX 2060 6GB (personal computer).

### (TODO: Your README)
## Implemented Features

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project includes several scan and stream compaction algorithms, and most of them are implemented with parallelism in CUDA. With the serial version of algorithms run on CPU as comparison, we can have a better view on the performance of parallel algorithm run on GPU.

- CPU Scan & Stream Compaction

- Naive GPU Scan Algorithm

- Work-Efficient GPU Scan & Stream Compaction

- Thrust Scan

- Radix Sort (Extra Credit)

Please see the details in the last part of the report.

## Performance Analysis

### Performances under Different Block Size

![Different Block Size](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_blocksize_perf.png)

In general, when ```blockSize = 128```, the parallel version of scan algorithms and compact algorithms could achieve a relative optimized performance.

### Scan Algorithms Performances

![Scan Algorithms Performances](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_arraysize_scan_perf.png)

This performance analysis tested when ```blockSize = 256```, and all algorithms taken into accounts are given the input with an power-of-two array size.

When the input array size is small (```SIZE < 2^15```), the difference of performances is small and not obvious for all the scan algorithms, but the serial version of algorithm run on CPU performs better than those parallel version of algorithms run on GPU.

When the input array size is large enough (```SIZE > 2^17```), the difference of performances becomes larger and larger, and apparently at this time, ```Thrust::Scan``` performs best among all algorithms. As expected, the ```CPU::Scan``` algorithm performs much worse than ```Thrust::Scan```, it is even worse than ```Naive::Scan```algorithm. However, it is quite weird that, the naive scan algorithm always runs faster than the work-efficient scan algorithm, because the so-called "efficient" work-efficient scan algorithm can still get optimized.

### Compact Algorithm Performances

![Compact Algorithm Performances](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/img/different_arraysize_compact_perf.png)

When the input array size is large enough (```SIZE > 2^17```), the difference of performances become more and more obvious, and both of the compact algorithms perform much worse as the input size increases, as expected. The serial version of compact algorithm run on CPU performs better than the parallel version run in GPU when the input size is small, however, when the input size is quite huge, such as ```SIZE = 2^20```, there is no doubt that the work-efficient compact algorithm run in parallel on GPU perform much better than the CPU version.

### Output

This output tests were based on an array ```SIZE = 1024``` and ```blockSize = 128```:

```bash
****************
** SCAN TESTS **
****************
[ 46 46 37 6 29 0 28 22 25 23 3 11 29 ... 20 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0013ms (std::chrono Measured)
[ 0 46 92 129 135 164 164 192 214 239 262 265 276 ... 24775 24795 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0007ms (std::chrono Measured)
[ 0 46 92 129 135 164 164 192 214 239 262 265 276 ... 24725 24751 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.02192ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.021952ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.046304ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.045856ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.038912ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.038176ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 0 1 2 3 2 0 0 1 1 3 1 1 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0026ms (std::chrono Measured)
[ 2 1 2 3 2 1 1 3 1 1 1 2 1 ... 2 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0024ms (std::chrono Measured)
[ 2 1 2 3 2 1 1 3 1 1 1 2 1 ... 1 2 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0054ms (std::chrono Measured)
[ 2 1 2 3 2 1 1 3 1 1 1 2 1 ... 2 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.062976ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.052992ms (CUDA Measured)
passed
```

## Extra Credit

- **Radix sort**

I defined the several function used by Radix Sort in [radix.h](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/stream_compaction/radix.h) and implemented it in [radix.cu](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/stream_compaction/radix.cu) under the directory [/stream_compaction](https://github.com/giaosame/Project2-Stream-Compaction/tree/master/stream_compaction). In [main.cpp](https://github.com/giaosame/Project2-Stream-Compaction/blob/master/src/main.cpp), I called this function ```StreamCompaction::Radix::sort``` in the last of the ```main``` function.

```c++
zeroArray(SIZE, c);
printDesc("radix sort, power-of-two");
StreamCompaction::CPU::sort(SIZE, b, a);
StreamCompaction::Radix::sort(SIZE, c, a);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("radix sort, non-power-of-two");
StreamCompaction::CPU::sort(NPOT, b, a);
StreamCompaction::Radix::sort(NPOT, c, a);
printCmpResult(NPOT, b, c);
```

Examples of output of Radix Sort:

```bash
**********************
** RADIX SORT TESTS **
**********************
[ 10 31 19 93 79 96 60 46 46 85 44 56 52 53 85 39 ]
==== radix sort, power-of-two ====
[ 10 19 31 39 44 46 46 52 53 56 60 79 85 85 93 96 ]
passed
==== radix sort, non-power-of-two ====
[ 10 19 31 44 46 46 52 56 60 79 85 93 96 ]
passed
```





Binary file added img/different_arraysize_compact_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/different_arraysize_scan_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/different_blocksize_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 24 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@
#include <stream_compaction/cpu.h>
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/radix.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 5; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -137,16 +138,36 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("**********************\n");
printf("** RADIX SORT TESTS **\n");
printf("**********************\n");

genArray(SIZE, a, 100); // Leave a 0 at the end to test that edge case
printArray(SIZE, a, true);

zeroArray(SIZE, c);
printDesc("radix sort, power-of-two");
StreamCompaction::CPU::sort(SIZE, b, a);
StreamCompaction::Radix::sort(SIZE, c, a);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("radix sort, non-power-of-two");
StreamCompaction::CPU::sort(NPOT, b, a);
StreamCompaction::Radix::sort(NPOT, c, a);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix.cu"
)

list(SORT headers)
Expand Down
36 changes: 29 additions & 7 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,48 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {
exit(EXIT_FAILURE);
}


namespace StreamCompaction {
namespace Common {

// Initialize array in gpu
__global__ void kernInitializeArray(int n, int* a, int value)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n)
{
a[index] = value;
}
}

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}

bools[index] = idata[index] != 0 ? 1 : 0;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
* if bools[index] == 1, it copies idata[index] to odata[indices[index]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
}
__global__ void kernScatter(int n, int* odata, const int* idata, const int* bools, const int* indices) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}

if (bools[index])
{
odata[indices[index]] = idata[index];
}
}
}
}
6 changes: 5 additions & 1 deletion stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

#include <cuda.h>
#include <cuda_runtime.h>

#include <cstdio>
#include <cstring>
#include <cmath>
Expand All @@ -13,6 +12,9 @@
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

// Block size used for CUDA kernel launch
#define blockSize 256

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
Expand All @@ -32,6 +34,8 @@ inline int ilog2ceil(int x) {

namespace StreamCompaction {
namespace Common {
__global__ void kernInitializeArray(int n, int* a, int value);

__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
Expand Down
54 changes: 49 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,14 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int prefixSum = 0;
for (int i = 0; i < n; i++)
{
odata[i] = prefixSum;
prefixSum += idata[i];
}

timer().endCpuTimer();
}

Expand All @@ -30,9 +37,16 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int cnt = 0;
for (int i = 0; i < n; i++)
{
if (idata[i] != 0)
odata[cnt++] = idata[i];
}

timer().endCpuTimer();
return -1;
return cnt;
}

/**
Expand All @@ -41,10 +55,40 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
// Compute temporary array and run exclusive scan on temporary array
int prefixSum = 0;
int* tdata = new int[n];
int* sdata = new int[n];

timer().startCpuTimer();
// TODO
for (int i = 0; i < n; i++)
{
tdata[i] = idata[i] != 0 ? 1 : 0;
sdata[i] = prefixSum;
prefixSum += tdata[i];
}

// Scatter
int idx = 0;
for (int i = 0; i < n; i++)
{
if (tdata[i])
{
idx = sdata[i];
odata[idx] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return idx + 1;
}

void sort(int n, int* odata, const int* idata)
{
for (int i = 0; i < n; i++)
{
odata[i] = idata[i];
}
std::sort(odata, odata + n);
}
}
}
2 changes: 2 additions & 0 deletions stream_compaction/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ namespace StreamCompaction {

void scan(int n, int *odata, const int *idata);

void sort(int n, int* odata, const int* idata);

int compactWithoutScan(int n, int *odata, const int *idata);

int compactWithScan(int n, int *odata, const int *idata);
Expand Down
Loading