We find that many types of computing-resources (such as CUDA-GPU and FPGA) have parallel waiting problem, which is bad for deep learning inference applications which are computationally intensive and delay-sensitive. To solve the above problem, one can consider intercepting API calls from the hardware driver layer, as in GPU virtualization, but this makes the generality greatly reduced and the system over-coupled. Therefore, we innovatively start from the model and create a generic allocator and mask the driver layer scheduling to alleviate the above problems, expecting to obtain better service latency for each request.
we test our program with
GTX-2080Ti with 10-core GPU
gcc/g++withv8.4.0grpcwith github commit version8f6ae3599f247c3e0de604b5321538b99f3d68a3protobufwith3.22.2(install by grpc source code)onnxruntime-gpuwithv1.12.1- C++ compiler param support:
-std=c++17-lstdc++fs-lonnxruntime-lprotobuf-lpthread- add
-DPARALLER_MODEif you only want to mask our allocator mechanism.
- nlohmann::json library installed.
you can get muti-version by give compiler-flag, DLIR_MODE (default), BNST_MODE, FIFO_MODE, OYST_MODE and PARALLER_MODE are available.
DLIR_MODE: DLIR mode, to allow auto-split and sortBNST_MODE: similar toDLIR_MODE, but split is not allowed.OYST_MODE: similar toDLIR_MODE, but split is forced.PARALLER_MODE: run all kinds of task in muti-process.FIFO_MODE: run task with FIFO.
- compile
DLIR_MODEas an example:
git clone [email protected]:EdgeScheduler/DLIR-Allocator.git
cd DLIR-Allocator
mkdir -p build && cd build
cmake ../ -DCOMPILE_MODE="DLIR_MODE"
make
# you can get binary in DLIR-Allocator/bin/release/DLIR-Allocator- Also, you can compile to all version by run script directly:
git clone [email protected]:EdgeScheduler/DLIR-Allocator.git
cd DLIR-Allocator
./scripts/build.sh
# you can get all binary in DLIR-Allocator/bin/release/*-Allocator- Operation System
- Linux (Ubuntu test)
- Hardware Support
- CUDA-GPU (GTX 2080Ti and Tesla-T4 test)
- to-do
- Mali-GPU
- FPGA
- DSP
Relationship with OnnxSplitRunner
In order to eliminate the negative effects of fake-multi-threading mechanism of Python course by GIL, we eventually decided to refactor the code with C++. Raw Project with Python can still be found at: https://github.com/EdgeScheduler/OnnxSplitRunner