Neural Network Acceleration on GPUs

Project Overview

This project focuses on accelerating a neural network implementation for the MNIST classification task using GPU programming with CUDA. We begin with a sequential CPU implementation (V1) and progressively optimize it to maximize performance on the GPU (V4). The key goal is to gain hands-on experience in parallel computing, high-performance computing (HPC), and CUDA optimizations.

Repository Structure

├── src
│   ├── V1  # Baseline sequential implementation
│   ├── V2  # Naive GPU implementation
│   ├── V3  # Optimized GPU implementation with performance improvements
│   ├── V4  # Optimized GPU implementation utilizing tensor cores
│   ├── V5  # Optimized GPU implementation using OpenACC
├── data    # Contains the MNIST dataset
├── report  # Project report
├── slides  # Presentation slides
├── README.md  # Project documentation and instructions

Prerequisites

NVIDIA GPU with CUDA support
CUDA Toolkit installed
nvcc compiler available
make utility installed

Compilation and Execution

Compilation

Navigate to the src directory and run:

make

This will compile the project and generate an executable located in build/nn.exe.

Running the Program

To execute the program, run:

make run

This will execute the compiled neural network and move profiling data if available.

Profiling and Speedup Execution

To run the profiling version:

make prof-run
make nsight-analyze
make speedup

This generates profiling data for performance analysis.

Cleaning Build Files

To remove all compiled files and reset the build directory:

make clean

Code Structure

main.cu: Entry point for the neural network execution.
neural_net.cu: Core implementation of the neural network.
utils.cu: Utility functions for matrix operations and timers.
mnist.cu: MNIST dataset handling functions.
nn.h: Header file defining neural network parameters.
utils.h: Header file defining helper functions for matrix operations and timing.
speedup_analysis.c: Compares all versions and gives speedup analysis.

Optimization Strategy

Each version of the project applies different optimization techniques:

V1 (Baseline CPU Implementation)

Sequential execution on CPU.
No parallelism or GPU acceleration.

V2 (Naive GPU Implementation)

Converts matrix operations to CUDA kernels.
Parallel execution but lacks optimizations.

V3 (Optimized GPU Implementation)

Optimized kernel launch configuration.
Improved occupancy and memory usage.
Reduced communication overhead.
Efficient memory hierarchy utilization.
Utilized Cuda Streams
Utilized Pinned Memory
Initialization shifted to kernel side
Combined multiple small kernels
Utilized Shared Memory
Used Optimized Compiler Flags

V4 (Tensor Core Optimization)

Utilizes Tensor Cores for matrix multiplications.
Further speedup through specialized CUDA libraries.

V5: (OpenACC Implementation)

Directive-based parallelism.
Quick porting, hardware abstraction.

Authors

Umer Farooq
Muhammad Irtaza Khan

Github Repository

https://github.com/Umer-Farooq-CS/MNIST-Classification.git

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Neural Network Acceleration on GPUs

Project Overview

Repository Structure

Prerequisites

Compilation and Execution

Compilation

Running the Program

Profiling and Speedup Execution

Cleaning Build Files

Code Structure

Optimization Strategy

V1 (Baseline CPU Implementation)

V2 (Naive GPU Implementation)

V3 (Optimized GPU Implementation)

V4 (Tensor Core Optimization)

V5: (OpenACC Implementation)

Authors

Github Repository

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
report		report
slides		slides
src		src
LICENSE		LICENSE
README.md		README.md

License

Umer-Farooq-CS/MNIST-Classification

Folders and files

Latest commit

History

Repository files navigation

Neural Network Acceleration on GPUs

Project Overview

Repository Structure

Prerequisites

Compilation and Execution

Compilation

Running the Program

Profiling and Speedup Execution

Cleaning Build Files

Code Structure

Optimization Strategy

V1 (Baseline CPU Implementation)

V2 (Naive GPU Implementation)

V3 (Optimized GPU Implementation)

V4 (Tensor Core Optimization)

V5: (OpenACC Implementation)

Authors

Github Repository

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages