This project focuses on accelerating a neural network implementation for the MNIST classification task using GPU programming with CUDA. We begin with a sequential CPU implementation (V1) and progressively optimize it to maximize performance on the GPU (V4). The key goal is to gain hands-on experience in parallel computing, high-performance computing (HPC), and CUDA optimizations.
├── src
│   ├── V1  # Baseline sequential implementation
│   ├── V2  # Naive GPU implementation
│   ├── V3  # Optimized GPU implementation with performance improvements
│   ├── V4  # Optimized GPU implementation utilizing tensor cores
│   ├── V5  # Optimized GPU implementation using OpenACC
├── data    # Contains the MNIST dataset
├── report  # Project report
├── slides  # Presentation slides
├── README.md  # Project documentation and instructions
- NVIDIA GPU with CUDA support
 - CUDA Toolkit installed
 nvcccompiler availablemakeutility installed
Navigate to the src directory and run:
makeThis will compile the project and generate an executable located in build/nn.exe.
To execute the program, run:
make runThis will execute the compiled neural network and move profiling data if available.
To run the profiling version:
make prof-run
make nsight-analyze
make speedupThis generates profiling data for performance analysis.
To remove all compiled files and reset the build directory:
make cleanmain.cu: Entry point for the neural network execution.neural_net.cu: Core implementation of the neural network.utils.cu: Utility functions for matrix operations and timers.mnist.cu: MNIST dataset handling functions.nn.h: Header file defining neural network parameters.utils.h: Header file defining helper functions for matrix operations and timing.speedup_analysis.c: Compares all versions and gives speedup analysis.
Each version of the project applies different optimization techniques:
- Sequential execution on CPU.
 - No parallelism or GPU acceleration.
 
- Converts matrix operations to CUDA kernels.
 - Parallel execution but lacks optimizations.
 
- Optimized kernel launch configuration.
 - Improved occupancy and memory usage.
 - Reduced communication overhead.
 - Efficient memory hierarchy utilization.
 - Utilized Cuda Streams
 - Utilized Pinned Memory
 - Initialization shifted to kernel side
 - Combined multiple small kernels
 - Utilized Shared Memory
 - Used Optimized Compiler Flags
 
- Utilizes Tensor Cores for matrix multiplications.
 - Further speedup through specialized CUDA libraries.
 
- Directive-based parallelism.
 - Quick porting, hardware abstraction.
 
- Umer Farooq
 - Muhammad Irtaza Khan