Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 114 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,121 @@
CUDA Stream Compaction
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
**University of Pennsylvania, CIS 565: GPU Programming and Architecture**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Jiarui Yan
* [LinkedIn](https://www.linkedin.com/in/jiarui-yan-a06bb5197?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BvRlITiOMSt%2B9Mgg6SZFKDQ%3D%3D), [personal website](https://jiaruiyan.pb.online/), [twitter](https://twitter.com/JerryYan1997), etc.
* Tested on: Windows 10 Home, i7-9700K @ 3.60GHz 16GB DDR4 RAM, RTX 2070 SUPER 8GB Dedicated GPU memory (Personal desktop)

### (TODO: Your README)
## Features

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* CPU Scan & Stream Compaction

* Naive GPU Scan Algorithm

* Work-Efficient GPU Scan & Stream Compaction

* Thrust scan

* Compact threads (Part 5 Extra Credit -- +5)

* Parallel Radix Sort (Part 6 Extra Credit -- +10)

## Roughly optimize for each implementations

In this part, in order to determine the suitable blocksize for each implementations, I run different scenarios under several discrete blocksizes: 16, 32, 64, 128, 256, 512. Then, I collect their average running time and plot them respectively. They are shown below:

![Experiment 1](./img/Naive_GPU_scan_blocksize.PNG)

![Experiment 2](./img/work_efficient_GPU_scan_blocksize.PNG)

![Experiment 3](./img/efficient_compaction_blocksize.PNG)

![Experiment 4](./img/parallel_radix_sort_blocksize.PNG)

As you can see from graphs above, a blocksize that is 32 is suitable for all the scenarios. I think this maybe caused by the fact that the input array size is not long enough. I control the input size for power-of-two scenarios to be 256. Besides, 32 is also the number of threads in a warp. As a result, 32 is a preferable size for all of them. Subsequently, I will fix their blocksize to 32 and do analysis under this blocksize.

## Performance comparsion

![Experiment 5](./img/scan_time_for_different_array_size.PNG)

As we can see from the graph above, we can figure out that GPU's advantages can be shown when the length of input array is long enough.

As for Thrust implementation, I guess it implement shared memory and all sorts of tricks to improve the proformence and try their best to exclude operations like memory copy.

## Brief explanation of phenomena

From my observation of the NSight timeline, all of my code are bounded by memory I/O, because the gap between each computation is relatively bigger than computation parts. Besides, each implementation is different. For instance, Naive scan would hand in lots of CUDA kernels, while other implementation hands in less kernels.

### Output of the test program

Here is the output by inputing an 2^20 array. As for the 'parallel radix sort', please take a look at the thrid part of the ouput below.

```
****************
** SCAN TESTS **
****************
[ 3 16 44 30 6 8 15 16 11 48 38 32 12 ... 46 0 ]
==== cpu scan, power-of-two ====
elapsed time: 3.6953ms (std::chrono Measured)
[ 0 3 19 63 93 99 107 122 138 149 197 235 267 ... 25674322 25674368 ]
==== cpu scan, non-power-of-two ====
elapsed time: 1.3371ms (std::chrono Measured)
[ 0 3 19 63 93 99 107 122 138 149 197 235 267 ... 25674263 25674285 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.999616ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.999424ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.227328ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.223232ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.245824ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.309088ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 2 0 0 2 0 3 0 3 0 0 2 0 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.0501ms (std::chrono Measured)
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.0319ms (std::chrono Measured)
passed
==== cpu compact with scan ====
elapsed time: 7.6485ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.7328ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.733216ms (CUDA Measured)
passed

*****************************
** PARALLEL RADIX SORT TESTS **
*****************************
[ 3 16 44 30 6 8 15 16 11 48 38 32 12 ... 46 0 ]
==== parallel radix sort, power-of-two correct result ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== parallel radix sort, power-of-two correct result ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== parallel radix sort, power-of-two ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
elapsed time: 0.733216ms (CUDA Measured)
passed
==== parallel radix sort, non-power-of-two ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
elapsed time: 0.733216ms (CUDA Measured)
passed
```
Binary file added img/Naive_GPU_scan_blocksize.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data.xlsx
Binary file not shown.
Binary file added img/efficient_compaction_blocksize.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/example-1.png
Binary file not shown.
Binary file removed img/example-2.jpg
Binary file not shown.
Binary file removed img/figure-39-2.jpg
Binary file not shown.
Binary file removed img/figure-39-4.jpg
Binary file not shown.
Binary file added img/parallel_radix_sort_blocksize.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan trend.xlsx
Binary file not shown.
Binary file added img/scan_time_for_different_array_size.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/work_efficient_GPU_scan_blocksize.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 54 additions & 13 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,17 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
int *d = new int[SIZE];

int main(int argc, char* argv[]) {
// Scan tests
Expand Down Expand Up @@ -47,19 +51,19 @@ int main(int argc, char* argv[]) {
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
onesArray(SIZE, c);
printDesc("1s array for finding bugs");
StreamCompaction::Naive::scan(SIZE, c, c);
printArray(SIZE, c, true);*/

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
// printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
onesArray(SIZE, c);
printDesc("1s array for finding bugs");
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
Expand All @@ -71,14 +75,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
// printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
// printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand All @@ -94,6 +98,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);


printf("\n");
printf("*****************************\n");
Expand All @@ -115,22 +120,22 @@ int main(int argc, char* argv[]) {
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedCount = count;
printArray(count, b, true);
// printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
printArray(count, c, true);
// printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
// printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
Expand All @@ -147,6 +152,42 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);


printf("\n");
printf("*****************************\n");
printf("** PARALLEL RADIX SORT TESTS **\n");
printf("*****************************\n");

// Radix Sort Tests
genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("parallel radix sort, power-of-two correct result");
memcpy(b, a, SIZE * sizeof(int));
thrust::sort(thrust::host, b, b + SIZE);
printArray(SIZE, b, true);

printDesc("parallel radix sort, power-of-two correct result");
memcpy(d, a, NPOT * sizeof(int));
thrust::sort(thrust::host, d, d + NPOT);
printArray(NPOT, d, true);

zeroArray(SIZE, c);
printDesc("parallel radix sort, power-of-two");
StreamCompaction::RadixSort::radix_sort(SIZE, c, a);
printArray(SIZE, c, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("parallel radix sort, non-power-of-two");
StreamCompaction::RadixSort::radix_sort(NPOT, c, a);
printArray(NPOT, c, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printCmpResult(NPOT, d, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix.cu"
)

list(SORT headers)
Expand Down
18 changes: 18 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,16 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (idata[index] == 0) {
bools[index] = 0;
}
else {
bools[index] = 1;
}
}

/**
Expand All @@ -33,6 +43,14 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (bools[index] == 1) {
int final_index = indices[index];
odata[final_index] = idata[index];
}
}

}
Expand Down
43 changes: 41 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int k = 1; k < n; ++k)
{
odata[k] = odata[k - 1] + idata[k - 1];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +36,15 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int o_idx = 0;
for (int i_idx = 0; i_idx < n; ++i_idx) {
if (idata[i_idx] != 0) {
odata[o_idx] = idata[i_idx];
++o_idx;
}
}
timer().endCpuTimer();
return -1;
return o_idx;
}

/**
Expand All @@ -43,8 +55,35 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int* temp_array = new int[n];
int* scan_array = new int[n];
// Compute temporary array:
for (int i_idx = 0; i_idx < n; ++i_idx) {
if (idata[i_idx] != 0) {
temp_array[i_idx] = 1;
}
else {
temp_array[i_idx] = 0;
}
}
// Exclusive scan:
scan_array[0] = 0;
for (int k = 1; k < n; ++k)
{
scan_array[k] = scan_array[k - 1] + temp_array[k - 1];
}
// Scatter:
int o_counter = 0;
for (int i_idx = 0; i_idx < n; ++i_idx)
{
if (temp_array[i_idx] == 1) {
int o_idx = scan_array[i_idx];
odata[o_idx] = idata[i_idx];
++o_counter;
}
}
timer().endCpuTimer();
return -1;
return o_counter;
}
}
}
Loading