Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
47ae045
cpu
Sep 12, 2019
e5db042
Naive, Efficient and Thrust scans
Sep 13, 2019
8cde5f6
Stream Compaction
Sep 14, 2019
4090c29
NN Implementation
Sep 15, 2019
8d9da6e
NN: bug fixes, cleanup, refactoring
Sep 15, 2019
32ec5ed
Stream Compact Metrics Collection and main script cleanup
Sep 16, 2019
c0e6905
Normalization
Sep 16, 2019
558df6c
Modularizing, Comments,Bug Fixes
Sep 17, 2019
ab2c194
Stream Compaction Performance Plots
DishaJindal Sep 17, 2019
d6857af
Outputs and Original stream compaction main
Sep 17, 2019
486a851
Performance Analysis
Sep 17, 2019
3a51e6e
README v1
DishaJindal Sep 17, 2019
838d3cc
Learning Curves and Predictions
Sep 17, 2019
d40711f
Merge branch 'master' of https://github.com/DishaJindal/Project2-Numb…
Sep 17, 2019
f4736f0
MLP
Sep 17, 2019
40743f3
Activation_Functions
Sep 17, 2019
a721e78
XOR Loss
Sep 17, 2019
6156b48
Model Weights
Sep 17, 2019
78846d1
README v1
DishaJindal Sep 17, 2019
b0bfabf
README
DishaJindal Sep 17, 2019
a2bb02a
README
DishaJindal Sep 17, 2019
9a9fcb5
Fixing Links
DishaJindal Sep 17, 2019
e496533
Screenshot to triple tick output
DishaJindal Sep 18, 2019
61cfa1c
Updating README with Radix Sort
DishaJindal Sep 18, 2019
f08230a
Radix Sort
Sep 18, 2019
5090463
Merge branch 'master' of https://github.com/DishaJindal/Project2-Numb…
Sep 18, 2019
d486c93
Refined Description
DishaJindal Sep 18, 2019
9380257
Updated README
DishaJindal Sep 18, 2019
3fd3559
added timer
Sep 18, 2019
8b26df9
Merge branch 'master' of https://github.com/DishaJindal/Project2-Numb…
Sep 18, 2019
5b3c0e1
. to :
DishaJindal Sep 18, 2019
02c8360
Readme
DishaJindal Sep 27, 2019
6dc3c02
readme
DishaJindal Sep 27, 2019
807497c
readme
DishaJindal Sep 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Project2-Character-Recognition/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
endif()

include_directories(.)
link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib/x64)
add_subdirectory(character_recognition)

cuda_add_executable(${CMAKE_PROJECT_NAME}
Expand All @@ -32,4 +33,6 @@ cuda_add_executable(${CMAKE_PROJECT_NAME}
target_link_libraries(${CMAKE_PROJECT_NAME}
character_recognition
${CORELIBS}
cublas
curand
)
74 changes: 68 additions & 6 deletions Project2-Character-Recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,74 @@ CUDA Character Recognition

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Disha Jindal: [Linkedin](https://www.linkedin.com/in/disha-jindal/)
* Tested on: Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)

### (TODO: Your README)
## Description
This code implements a simple neural network (Multi-layer Perceptron) in CUDA. It is tested on XOR and Character Recognition dataset. Following architechure depicts a neural network with an input layer, hidden layer and output layer.
<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/MLP_Architecture.png" width="600"/></p>

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Architecture
**Input Layer** : The number of nodes in the input layer are equal to the number of features in an image which is 10201 in the given dataset.

**Hidden Layer** : I have the analyzed the convergence of model with different number of hidden nodes. It converges pretty fast for any number greater than 10, refer Performance Analysis section for more details.

**Output Layer**: The number of nodes in the output layer are equal to the number of classes which is 52 in the case of character recognition and 2 in case of XOR.

**Activation Functions**: I tried two activation functions for the hidden layer: Sigmoid and ReLU and the model converges using any of these two.
<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/Sigmoid_ReLU.png" width="600"/></p>

**Loss Function**: I have used cross entropy loss for this classfication problem setup.

### Implementation
`main.cpp` is the orchestrator. I have written two different functions one for XOR and the other one for Character Recognition. The required fucntion can be called from the main function.
`mlp.cu` is the file contains all MLP specific functions with `init`, `train`, `test` and `free` being the ones called from the main script.
`funcions.cu` contains the required kernels for elementwise operations and others.

**Training** `train` function in `mlp.cu` orchestrators this. The entire training process runs spoch number of times. The main componants of the training phase are forward pass, backward pass and loss calculation. Forward pass does the matrix multiplication to populte all buffers for hidden and output layers. Backward pass computes the gradients and updates the weight matrices. The loss calculation step uses the true and predicted outputs to calculate the cross entropy loss which is plotted in the following learning curves. The last step is to save the final model weights to a file.

**Testing** `test` function in `mlp.cu` orchestrators this. This comes into picture in the inference phase which comes after the model is trained. In this phase, I pass the input through the forward pass, calculate the argmax to find the output node with the maximum probability. This gives the output along with the model's confidence in that prediction.

**Hyperparameter Tuning** The two main hyperparameters in this model are the number of nodes in the hidden layer and the learning rate. I have tuned both of these parameters and The Performance Analsysis section below tasks about this in detail.

## Performance Analysis
Learning curves shows the model's learning process by plotting the loss at each epoch in the training phase. Following are the learning curves of various model on different parameters.
### XOR
Following is the learning curve for XOR.
<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/XOR_Loss.PNG" width="600"/></p>

### Character Recognition
#### Hidden Nodes
Following are the learning curves of models trained using different number of nodes in the hidden layer. It goes from 5 to 100 hidden nodes. Since, with 100 nodes the loss dropped to 0 in 3 iterations, there is no point increasing the nodes even further. This clearly shows that in learning rate of the model increases on increasing the number of hidden nodes. It takes more than 500 epochs in case of 5 or 10 hidden units. After 20, the loss goes to 0 within 100 epochs.
<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/LC_HiddenNodes.PNG" width="700"/></p>

#### Learning Rate
Following are the learning curves of models trained using learning rates. The plots values from 0.01 to 1. If you look at the curves corresponding to 0.01, 0.05 and 0.1, it shows that the learning rate of model imporves with the increase in learning rate. This is because we are making bigger updates at each step. Whereas, if we go beyond that as can be seen from the curve corresponding to 0.5 and 1, it starts to learn faster but then it does not even converge. This is because we are making very large changes in the weights at each step which makes the gradients to get stuck at various local minimas (corresponding to each character rather than learning the actual pattern).
<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/LC_LearningRate.PNG" width="700"/></p>

## Model Weights
There are two weight matrices in this model.

**W1**: Weights between input and hidden layers. The dimesnions are INPUT_LAYER_SIZE * HIDDEN_LAYER_SIZE.

**W2**: Weights between hidden and output layers. The dimesnions are HIDDEN_LAYER_SIZE * OUTPUT_LAYER_SIZE.

Following are the links to the weight files for both XOR and Character Recognition models

**XOR**
[W1](https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/weights/xor_model_w1.txt) ,
[W2](https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/weights/xor_model_w2.txt)

**Character Recognition**
[W1](https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/weights/cr_model_w1.txt) ,
[W2](https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/weights/cr_model_w2.txt)
## Predictions
Following are the outputs for Character Recognition and XOR from the corresponding models. This is implemented in the test function. The inputs are shuffled before testing using shuffle method from `#include<algorithm>` to make sure that the model is not just learning the order of inputs. And then by using the forward pass and argmax logic, the target predictions and model predictions are printed along with the probabilities.
### Character Recognition

<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/Predictions_CharacterRecognition.PNG" width="600"/></p>

### XOR

<p align="center"><img src="https://github.com/DishaJindal/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/img/Predictions_XOR.PNG" width="600"/></p>

Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
set(SOURCE_FILES
"common.h"
"common.cu"
"functions.cu"
"functions.h"
"mlp.h"
"mlp.cu"
)

cuda_add_library(character_recognition
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_60
)
48 changes: 37 additions & 11 deletions Project2-Character-Recognition/character_recognition/common.cu
Original file line number Diff line number Diff line change
@@ -1,15 +1,41 @@
#include "common.h"
#include "device_launch_parameters.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

void printArray(int n, float *a, bool abridged = false) {
printf(" [ ");
for (int i = 0; i < n; i++) {
if (abridged && i + 2 == 15 && n > 16) {
i = n - 2;
printf("... ");
}
printf("%f ", a[i]);
}
printf("]\n");
}

void printCudaArray(int size, float* data) {
float *d_data = new float[size];
cudaMemcpy(d_data, data, size * sizeof(float), cudaMemcpyDeviceToHost);
printArray(size, d_data, true);
}

void printCuda2DArray(int height, int width, float* data) {
float *d_data = new float[width*height];
cudaMemcpy(d_data, data, width*height * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < height; i++)
printArray(width, d_data + i * width, true);
}
195 changes: 97 additions & 98 deletions Project2-Character-Recognition/character_recognition/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,108 +19,107 @@
void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);

inline int ilog2(int x) {
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
}

inline int ilog2ceil(int x) {
return x == 1 ? 0 : ilog2(x - 1) + 1;
return x == 1 ? 0 : ilog2(x - 1) + 1;
}


namespace Common {
/**
* This class is used for timing the performance
* Uncopyable and unmovable
*
* Adapted from WindyDarian(https://github.com/WindyDarian)
*/
class PerformanceTimer
{
public:
PerformanceTimer()
{
cudaEventCreate(&event_start);
cudaEventCreate(&event_end);
}

~PerformanceTimer()
{
cudaEventDestroy(event_start);
cudaEventDestroy(event_end);
}

void startCpuTimer()
{
if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
cpu_timer_started = true;

time_start_cpu = std::chrono::high_resolution_clock::now();
}

void endCpuTimer()
{
time_end_cpu = std::chrono::high_resolution_clock::now();

if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }

std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
prev_elapsed_time_cpu_milliseconds =
static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());

cpu_timer_started = false;
}

void startGpuTimer()
{
if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
gpu_timer_started = true;

cudaEventRecord(event_start);
}

void endGpuTimer()
{
cudaEventRecord(event_end);
cudaEventSynchronize(event_end);

if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }

cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
gpu_timer_started = false;
}

float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
{
return prev_elapsed_time_cpu_milliseconds;
}

float getGpuElapsedTimeForPreviousOperation() //noexcept
{
return prev_elapsed_time_gpu_milliseconds;
}

// remove copy and move functions
PerformanceTimer(const PerformanceTimer&) = delete;
PerformanceTimer(PerformanceTimer&&) = delete;
PerformanceTimer& operator=(const PerformanceTimer&) = delete;
PerformanceTimer& operator=(PerformanceTimer&&) = delete;

private:
cudaEvent_t event_start = nullptr;
cudaEvent_t event_end = nullptr;

using time_point_t = std::chrono::high_resolution_clock::time_point;
time_point_t time_start_cpu;
time_point_t time_end_cpu;

bool cpu_timer_started = false;
bool gpu_timer_started = false;

float prev_elapsed_time_cpu_milliseconds = 0.f;
float prev_elapsed_time_gpu_milliseconds = 0.f;
};
/**
* This class is used for timing the performance
* Uncopyable and unmovable
*
* Adapted from WindyDarian(https://github.com/WindyDarian)
*/
class PerformanceTimer
{
public:
PerformanceTimer()
{
cudaEventCreate(&event_start);
cudaEventCreate(&event_end);
}

~PerformanceTimer()
{
cudaEventDestroy(event_start);
cudaEventDestroy(event_end);
}

void startCpuTimer()
{
if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
cpu_timer_started = true;

time_start_cpu = std::chrono::high_resolution_clock::now();
}

void endCpuTimer()
{
time_end_cpu = std::chrono::high_resolution_clock::now();

if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }

std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
prev_elapsed_time_cpu_milliseconds =
static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());

cpu_timer_started = false;
}

void startGpuTimer()
{
if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
gpu_timer_started = true;

cudaEventRecord(event_start);
}

void endGpuTimer()
{
cudaEventRecord(event_end);
cudaEventSynchronize(event_end);

if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }

cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
gpu_timer_started = false;
}

float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
{
return prev_elapsed_time_cpu_milliseconds;
}

float getGpuElapsedTimeForPreviousOperation() //noexcept
{
return prev_elapsed_time_gpu_milliseconds;
}

// remove copy and move functions
PerformanceTimer(const PerformanceTimer&) = delete;
PerformanceTimer(PerformanceTimer&&) = delete;
PerformanceTimer& operator=(const PerformanceTimer&) = delete;
PerformanceTimer& operator=(PerformanceTimer&&) = delete;

private:
cudaEvent_t event_start = nullptr;
cudaEvent_t event_end = nullptr;

using time_point_t = std::chrono::high_resolution_clock::time_point;
time_point_t time_start_cpu;
time_point_t time_end_cpu;

bool cpu_timer_started = false;
bool gpu_timer_started = false;

float prev_elapsed_time_cpu_milliseconds = 0.f;
float prev_elapsed_time_gpu_milliseconds = 0.f;
};
}
Loading