A comprehensive platform for comparing performance across various inference servers including Triton Inference Server, TorchServe, and PyTorch Direct with a modern UI and advanced analytics capabilities.
- π¦ Real-time comparison of three distinct inference platforms
- π Visual performance metric analysis (latency, throughput, resource utilization)
- πΌοΈ Modern UI inspired by Apple and Zoox design principles
- π³ Effortless deployment via Docker containerization
- π Support for various model architectures (ResNet50, BERT, YOLOv5)
- π Comprehensive benchmarking capabilities with load testing
- π Dynamic dark/light mode with system preference detection
- π± Responsive design for mobile and desktop interfaces
- π¬ Detailed profiling of inference operations
flowchart TD
A[User Browser] -->|HTTP/WebSocket| B[Flask UI + API]
subgraph "Application Layer"
B -->|Metrics Collection| C[Prometheus/Grafana]
B -->|Performance Logging| D[Elasticsearch]
end
subgraph "Inference Layer"
B -->|REST API| E[Triton Inference Server]
B -->|gRPC| E
B -->|HTTP REST| F[TorchServe]
B -->|Direct Function Call| G[PyTorch Model]
end
subgraph "Model Storage"
H[(Model Repository)] --> E
I[(Model Store)] --> F
J[(Local Models)] --> G
end
subgraph "Hardware Layer"
E --> K[NVIDIA GPU]
F --> K
G --> K
E --> L[CPU Fallback]
F --> L
G --> L
end
C -->|Visualization| B
D -->|Analysis| B
E -->|Results + Metrics| B
F -->|Results + Metrics| B
G -->|Results + Metrics| B
classDef primary fill:#4b6bfb,stroke:#3355cc,color:white;
classDef secondary fill:#28a745,stroke:#22863a,color:white;
classDef hardware fill:#d73a49,stroke:#cb2431,color:white;
classDef storage fill:#6f42c1,stroke:#5a32a3,color:white;
class B primary;
class E,F,G secondary;
class K,L hardware;
class H,I,J storage;
- Docker 20.10+ and Docker Compose 2.0+
- Python 3.9+ with pip and venv
- NVIDIA GPU with CUDA 11.4+ (optional - for GPU acceleration)
- 8GB free disk space for models and containers
- 16GB RAM recommended for full stack deployment
- NVIDIA Container Toolkit (for GPU passthrough)
After downloading and extracting the code, you can use Docker Compose to set up the entire environment:
# Clone or download the code
git clone https://github.com/awrsha/inference-comparison.git
cd inference-comparison
# Set up with Docker
docker-compose up -d
# Check service status
docker-compose ps
# View logs
docker-compose logs -f
After setup, you can access the application at: http://localhost:5000
If you don't want to use Docker, you'll need to install and configure each service separately.
β
Option 1: Follow the official Triton Inference Server guide
β
Option 2: Use the Docker version:
docker run --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
tritonserver --model-repository=/models
β
Option 1: Follow the official TorchServe guide
β
Option 2: Use the Docker version:
docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 \
-v ${PWD}/model_store:/home/model-server/model-store pytorch/torchserve
To set up the Flask server, follow these steps:
pip install -r requirements.txt
python scripts/download_models.py
python run.py
β Now you can go to http://localhost:5000 and use the service.
1. Clone the repository:
git clone https://github.com/awrsha/inference-comparison.git
cd inference-comparison
2. Start with Docker Compose:
docker-compose up -d
3. Check service status:
docker-compose ps
β The application will be available at: http://localhost:5000
4. Stop services:
docker-compose down
1. Install prerequisites:
pip install -r requirements.txt
2. Download models:
python scripts/download_models.py
3. Start Triton Inference Server (in a separate terminal):
β Option 1: Local execution
tritonserver --model-repository=./model_repository
β Option 2: Using Docker
docker run --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
tritonserver --model-repository=/models
4. Start TorchServe (in a separate terminal):
β Option 1: Local execution
torchserve --start --model-store ./model_store --ncs
β Option 2: Using Docker
docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 \
-v ${PWD}/model_store:/home/model-server/model-store pytorch/torchserve
5. Start Flask server:
python run.py
β The application will be available at: http://localhost:5000
Metric | Triton | TorchServe | PyTorch |
---|---|---|---|
Average Latency (ms) | 23.5 | 29.2 | 35.8 |
Throughput (inferences/sec) | 42.5 | 34.2 | 27.9 |
Memory Usage (MB) | 1240 | 1520 | 1850 |
GPU Utilization (%) | 72 | 78 | 85 |
P95 Latency (ms) | 28.7 | 35.1 | 42.3 |
P99 Latency (ms) | 32.1 | 38.6 | 48.9 |
Startup Time (s) | 4.2 | 6.8 | 1.2 |
Model Loading Time (s) | 2.1 | 3.5 | 0.8 |
Batch Size | Triton (ms) | TorchServe (ms) | PyTorch (ms) |
---|---|---|---|
1 | 22 | 28 | 33 |
4 | 45 | 58 | 67 |
8 | 68 | 92 | 118 |
16 | 120 | 165 | 195 |
32 | 205 | 301 | 380 |
- Real-time performance monitoring dashboard
- Interactive model selection and configuration
- Batch size adjustment with immediate feedback
- Side-by-side comparison of inference results
- Detailed resource utilization graphs
- Export results in CSV/JSON/PDF formats
- Custom test creation and scheduling
- Hardware utilization timeline
- Comparative analysis across different hardware
- Profile sharing and collaboration tools
The platform supports extensive configuration through environment variables or config files:
# config.yml example
server:
host: 0.0.0.0
port: 5000
debug: false
workers: 4
triton:
url: localhost:8000
http_port: 8000
grpc_port: 8001
metrics_port: 8002
protocol: grpc # or http
torchserve:
url: localhost:8080
inference_port: 8080
management_port: 8081
metrics_port: 8082
pytorch:
device: cuda # or cpu
precision: fp16 # or fp32
benchmarking:
iterations: 1000
warmup_iterations: 100
concurrent_clients: 4
report_interval: 10
- HTTPS support with self-signed or custom certificates
- Basic authentication for UI access
- Role-based access control
- API key authentication for programmatic access
- JWT token support for stateless authentication
- Connection encryption for all API communication
- Audit logging for all operations
- Secure model storage with encryption
- Run unit tests:
pytest tests/unit/
- Run integration tests:
pytest tests/integration/
- Generate test coverage report:
pytest --cov=app tests/
- Lint code:
flake8 app/
black app/
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
The platform exposes a RESTful API for programmatic access:
GET /api/v1/models - List available models
POST /api/v1/inference - Run inference with specified server
GET /api/v1/metrics - Get performance metrics
POST /api/v1/benchmark - Run benchmark test
GET /api/v1/status - Get server status
For detailed API documentation, visit /api/docs
after starting the application.
Common issues and solutions:
- GPU not detected: Ensure NVIDIA drivers and Docker GPU support are properly configured
- Model loading fails: Check model format compatibility with server
- High latency: Reduce batch size or increase hardware resources
- Connection refused: Verify server ports are accessible and not blocked by firewall
- Out of memory: Reduce model precision or batch size
For detailed logs, check:
docker logs triton-server
docker logs torchserve
docker logs ai-inference-flask
- Triton Inference Server Documentation
- TorchServe Documentation
- PyTorch Documentation
- NVIDIA GPU Container Toolkit
This project is licensed under the MIT License - see the LICENSE file for details.