Skip to content

A comprehensive platform for comparing performance across various inference servers including Triton Inference Server, TorchServe, and PyTorch Direct with a modern UI and advanced analytics capabilities.

License

Notifications You must be signed in to change notification settings

Awrsha/inference-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AI Inference Server Comparison: Flask vs Triton/Nim

A comprehensive platform for comparing performance across various inference servers including Triton Inference Server, TorchServe, and PyTorch Direct with a modern UI and advanced analytics capabilities.

🌟 Key Features

  • 🚦 Real-time comparison of three distinct inference platforms
  • πŸ“Š Visual performance metric analysis (latency, throughput, resource utilization)
  • πŸ–ΌοΈ Modern UI inspired by Apple and Zoox design principles
  • 🐳 Effortless deployment via Docker containerization
  • πŸ”„ Support for various model architectures (ResNet50, BERT, YOLOv5)
  • πŸ“ˆ Comprehensive benchmarking capabilities with load testing
  • πŸŒ™ Dynamic dark/light mode with system preference detection
  • πŸ“± Responsive design for mobile and desktop interfaces
  • πŸ”¬ Detailed profiling of inference operations

πŸ“ System Architecture

flowchart TD
    A[User Browser] -->|HTTP/WebSocket| B[Flask UI + API]
    
    subgraph "Application Layer"
    B -->|Metrics Collection| C[Prometheus/Grafana]
    B -->|Performance Logging| D[Elasticsearch]
    end
    
    subgraph "Inference Layer"
    B -->|REST API| E[Triton Inference Server]
    B -->|gRPC| E
    B -->|HTTP REST| F[TorchServe]
    B -->|Direct Function Call| G[PyTorch Model]
    end
    
    subgraph "Model Storage"
    H[(Model Repository)] --> E
    I[(Model Store)] --> F
    J[(Local Models)] --> G
    end
    
    subgraph "Hardware Layer"
    E --> K[NVIDIA GPU]
    F --> K
    G --> K
    E --> L[CPU Fallback]
    F --> L
    G --> L
    end
    
    C -->|Visualization| B
    D -->|Analysis| B
    
    E -->|Results + Metrics| B
    F -->|Results + Metrics| B
    G -->|Results + Metrics| B
    
    classDef primary fill:#4b6bfb,stroke:#3355cc,color:white;
    classDef secondary fill:#28a745,stroke:#22863a,color:white;
    classDef hardware fill:#d73a49,stroke:#cb2431,color:white;
    classDef storage fill:#6f42c1,stroke:#5a32a3,color:white;
    
    class B primary;
    class E,F,G secondary;
    class K,L hardware;
    class H,I,J storage;

Loading

πŸ›  Prerequisites

  • Docker 20.10+ and Docker Compose 2.0+
  • Python 3.9+ with pip and venv
  • NVIDIA GPU with CUDA 11.4+ (optional - for GPU acceleration)
  • 8GB free disk space for models and containers
  • 16GB RAM recommended for full stack deployment
  • NVIDIA Container Toolkit (for GPU passthrough)

User Guide and Setup Instructions

Setup with Docker (Recommended)

After downloading and extracting the code, you can use Docker Compose to set up the entire environment:

# Clone or download the code
git clone https://github.com/awrsha/inference-comparison.git
cd inference-comparison

# Set up with Docker
docker-compose up -d

# Check service status
docker-compose ps

# View logs
docker-compose logs -f

After setup, you can access the application at: http://localhost:5000

Setup without Docker

If you don't want to use Docker, you'll need to install and configure each service separately.

1. Install Triton Inference Server

βœ… Option 1: Follow the official Triton Inference Server guide
βœ… Option 2: Use the Docker version:

docker run --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  tritonserver --model-repository=/models

2. Install TorchServe

βœ… Option 1: Follow the official TorchServe guide
βœ… Option 2: Use the Docker version:

docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 \
  -v ${PWD}/model_store:/home/model-server/model-store pytorch/torchserve

3. Set Up Flask Server

To set up the Flask server, follow these steps:

pip install -r requirements.txt
python scripts/download_models.py
python run.py

βœ… Now you can go to http://localhost:5000 and use the service.

Detailed Setup Instructions

Docker Setup (Recommended)

1. Clone the repository:

git clone https://github.com/awrsha/inference-comparison.git
cd inference-comparison

2. Start with Docker Compose:

docker-compose up -d

3. Check service status:

docker-compose ps

βœ… The application will be available at: http://localhost:5000

4. Stop services:

docker-compose down

Manual Installation (Without Docker)

1. Install prerequisites:

pip install -r requirements.txt

2. Download models:

python scripts/download_models.py

3. Start Triton Inference Server (in a separate terminal):

βœ… Option 1: Local execution

tritonserver --model-repository=./model_repository

βœ… Option 2: Using Docker

docker run --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  tritonserver --model-repository=/models

4. Start TorchServe (in a separate terminal):

βœ… Option 1: Local execution

torchserve --start --model-store ./model_store --ncs

βœ… Option 2: Using Docker

docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 \
  -v ${PWD}/model_store:/home/model-server/model-store pytorch/torchserve

5. Start Flask server:

python run.py

βœ… The application will be available at: http://localhost:5000

πŸ“Š Performance Metrics

Metric Triton TorchServe PyTorch
Average Latency (ms) 23.5 29.2 35.8
Throughput (inferences/sec) 42.5 34.2 27.9
Memory Usage (MB) 1240 1520 1850
GPU Utilization (%) 72 78 85
P95 Latency (ms) 28.7 35.1 42.3
P99 Latency (ms) 32.1 38.6 48.9
Startup Time (s) 4.2 6.8 1.2
Model Loading Time (s) 2.1 3.5 0.8

πŸ“ˆ Batch Size Performance Impact

Batch Size Triton (ms) TorchServe (ms) PyTorch (ms)
1 22 28 33
4 45 58 67
8 68 92 118
16 120 165 195
32 205 301 380

πŸ–₯ UI Features

  • Real-time performance monitoring dashboard
  • Interactive model selection and configuration
  • Batch size adjustment with immediate feedback
  • Side-by-side comparison of inference results
  • Detailed resource utilization graphs
  • Export results in CSV/JSON/PDF formats
  • Custom test creation and scheduling
  • Hardware utilization timeline
  • Comparative analysis across different hardware
  • Profile sharing and collaboration tools

βš™οΈ Configuration Options

The platform supports extensive configuration through environment variables or config files:

# config.yml example
server:
  host: 0.0.0.0
  port: 5000
  debug: false
  workers: 4

triton:
  url: localhost:8000
  http_port: 8000
  grpc_port: 8001
  metrics_port: 8002
  protocol: grpc  # or http

torchserve:
  url: localhost:8080
  inference_port: 8080
  management_port: 8081
  metrics_port: 8082

pytorch:
  device: cuda  # or cpu
  precision: fp16  # or fp32

benchmarking:
  iterations: 1000
  warmup_iterations: 100
  concurrent_clients: 4
  report_interval: 10

πŸ”’ Security Features

  • HTTPS support with self-signed or custom certificates
  • Basic authentication for UI access
  • Role-based access control
  • API key authentication for programmatic access
  • JWT token support for stateless authentication
  • Connection encryption for all API communication
  • Audit logging for all operations
  • Secure model storage with encryption

πŸ§ͺ Testing and Development

  1. Run unit tests:
pytest tests/unit/
  1. Run integration tests:
pytest tests/integration/
  1. Generate test coverage report:
pytest --cov=app tests/
  1. Lint code:
flake8 app/
black app/

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ API Documentation

The platform exposes a RESTful API for programmatic access:

Endpoints

GET /api/v1/models - List available models
POST /api/v1/inference - Run inference with specified server
GET /api/v1/metrics - Get performance metrics
POST /api/v1/benchmark - Run benchmark test
GET /api/v1/status - Get server status

For detailed API documentation, visit /api/docs after starting the application.

πŸ”§ Troubleshooting

Common issues and solutions:

  • GPU not detected: Ensure NVIDIA drivers and Docker GPU support are properly configured
  • Model loading fails: Check model format compatibility with server
  • High latency: Reduce batch size or increase hardware resources
  • Connection refused: Verify server ports are accessible and not blocked by firewall
  • Out of memory: Reduce model precision or batch size

For detailed logs, check:

docker logs triton-server
docker logs torchserve
docker logs ai-inference-flask

πŸ“š Additional Resources

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A comprehensive platform for comparing performance across various inference servers including Triton Inference Server, TorchServe, and PyTorch Direct with a modern UI and advanced analytics capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published