Skip to content

Commit

Permalink
get rid of like 80% of the code
Browse files Browse the repository at this point in the history
  • Loading branch information
ef0xa committed Feb 10, 2025
1 parent 9b901b9 commit d1c92f2
Show file tree
Hide file tree
Showing 10 changed files with 394 additions and 487 deletions.
69 changes: 31 additions & 38 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,49 +1,42 @@
FROM nvidia/cuda:12.1.0-base-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -y \
&& apt-get dist-upgrade -y \
&& apt-get install -y python3-pip

RUN ldconfig /usr/local/cuda-12.1/compat/

# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
# install sglang's dependencies

# EFRON:
# these guys are unbelivably huge - >80GiB. Took well over ten minutes to install on my machine and used 28GiB(!) of RAM.
# we should consider having a base image with them pre-installed or seeing if we can knock it down a little bit.
RUN python3 -m pip install "sglang[all]"
RUN python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3


# install _our_ dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install --upgrade pip && \
python3 -m pip install --upgrade -r /requirements.txt

RUN python3 -m pip install "sglang[all]" && \
python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
ARG TOKENIZER_NAME=""
ARG BASE_PATH="/runpod-volume"
ARG QUANTIZATION=""
ARG MODEL_REVISION=""
ARG TOKENIZER_REVISION=""

ENV MODEL_NAME=$MODEL_NAME \
MODEL_REVISION=$MODEL_REVISION \
TOKENIZER_NAME=$TOKENIZER_NAME \
TOKENIZER_REVISION=$TOKENIZER_REVISION \
BASE_PATH=$BASE_PATH \
QUANTIZATION=$QUANTIZATION \
HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets" \
HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub" \
HF_HOME="${BASE_PATH}/huggingface-cache/hub" \
HF_HUB_ENABLE_HF_TRANSFER=1

ENV PYTHONPATH="/:/vllm-workspace"


COPY src /src
RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
fi && \
if [ -n "$MODEL_NAME" ]; then \
python3 /src/download_model.py; \
fi

# Start the handler
CMD ["python3", "/src/handler.py"]
RUN mkdir app
COPY requirements.txt ./app/requirements.txt

# EFRON: no idea what this is doing: leaving it in in case it's important
ENV BASE_PATH=$BASE_PATH
ENV HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets"
ENV HF_HOME="${BASE_PATH}/huggingface-cache/hub"
ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub"
ENV MODEL_NAME=$MODEL_NAME
ENV MODEL_REVISION=$MODEL_REVISION
ENV QUANTIZATION=$QUANTIZATION
ENV TOKENIZER_NAME=$TOKENIZER_NAME
ENV TOKENIZER_REVISION=$TOKENIZER_REVISION

# not sure why this is here: is a vllm-workspace even in our image?
ENV PYTHONPATH="/:/vllm-workspace"
COPY ./src/handler.py ./app/handler.py
CMD ["python3", "./app/handler.py"] # actually run the handler
138 changes: 70 additions & 68 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,24 @@
<h1> SgLang Worker</h1>

🚀 | SGLang is fast serving framework for large language models and vision language models.

</div>

## RunPod Worker Images

Below is a summary of the available RunPod Worker images, categorized by image stability

| Stable Image Tag | Development Image Tag |
-----------------------------------|-----------------------------------|
`runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` |

| Stable Image Tag | Development Image Tag |
| ----------------------------------- | -------------------------------- |
| `runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` |

## 📖 | Getting Started

1. Clone this repository.
2. Build a docker image - ```docker build -t <your_username>:worker-sglang:v1 .```
3. ```docker push <your_username>:worker-sglang:v1```
2. Build a docker image - `docker build -t <your_username>:worker-sglang:v1 .`
3. `docker push <your_username>:worker-sglang:v1`


***Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint***:
**_Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint_**:

```
import runpod
Expand All @@ -38,10 +37,11 @@ run_request = endpoint.run({"your_model_input_key": "your_model_input_value"})
print(run_request.status())
# Get the output of the endpoint run request, blocking until the run is complete
print(run_request.output())
print(run_request.output())
```

### OpenAI compatible API

```python
from openai import OpenAI
import os
Expand All @@ -54,34 +54,35 @@ client = OpenAI(
```

`Chat Completions (Non-Streaming)`

```python
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
temperature=0,
max_tokens=100,

)
print(f"Response: {response}")
```

`Chat Completions (Streaming)`

```python
response_stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
temperature=0,
max_tokens=100,
stream=True

)
for response in response_stream:
print(response.choices[0].delta.content or "", end="", flush=True)
```



## SGLang Server Configuration

When launching an endpoint, you can configure the SGLang server using environment variables. These variables allow you to customize various aspects of the server's behavior without modifying the code.

### How to Use
Expand All @@ -91,63 +92,64 @@ The SGLang server will read these variables at startup and configure itself acco
If a variable is not set, the server will use its default value.

### Available Environment Variables
The following table lists all available environment variables for configuring the SGLang server:

The following table lists all available environment variables for configuring the SGLang server:

| Environment Variable | Description | Default | Options |
|----------------------|-------------|---------|---------|
| `MODEL_PATH` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
| `HOST` | Host of the server | "0.0.0.0" | |
| `PORT` | Port of the server | 30000 | |
| `TOKENIZER_PATH` | Path of the tokenizer | | |
| `ADDITIONAL_PORTS` | Additional ports for the server | | |
| `TOKENIZER_MODE` | Tokenizer mode | "auto" | "auto", "slow" |
| `LOAD_FORMAT` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
| `DTYPE` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
| `CONTEXT_LENGTH` | Model's maximum context length | | |
| `QUANTIZATION` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
| `SERVED_MODEL_NAME` | Override model name in API | | |
| `CHAT_TEMPLATE` | Chat template name or path | | |
| `MEM_FRACTION_STATIC` | Fraction of memory for static allocation | | |
| `MAX_RUNNING_REQUESTS` | Maximum number of running requests | | |
| `MAX_NUM_REQS` | Maximum requests in memory pool | | |
| `MAX_TOTAL_TOKENS` | Maximum tokens in memory pool | | |
| `CHUNKED_PREFILL_SIZE` | Max tokens in chunk for chunked prefill | | |
| `MAX_PREFILL_TOKENS` | Max tokens in prefill batch | | |
| `SCHEDULE_POLICY` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy | | |
| `TENSOR_PARALLEL_SIZE` | Tensor parallelism size | | |
| `STREAM_INTERVAL` | Streaming interval in token length | | |
| `RANDOM_SEED` | Random seed | | |
| `LOG_LEVEL` | Logging level for all loggers | | |
| `LOG_LEVEL_HTTP` | Logging level for HTTP server | | |
| `API_KEY` | API key for the server | | |
| `FILE_STORAGE_PTH` | Path of file storage in backend | | |
| `DATA_PARALLEL_SIZE` | Data parallelism size | | |
| `LOAD_BALANCE_METHOD` | Load balancing strategy | | "round_robin", "shortest_queue" |
| `NCCL_INIT_ADDR` | NCCL init address for multi-node | | |
| `NNODES` | Number of nodes | | |
| `NODE_RANK` | Node rank | | |
| Environment Variable | Description | Default | Options |
| --------------------------- | ---------------------------------------- | ------------------------------------- | ----------------------------------------------------------------------------------------- |
| `MODEL_PATH` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
| `HOST` | Host of the server | "0.0.0.0" | |
| `PORT` | Port of the server | 30000 | |
| `TOKENIZER_PATH` | Path of the tokenizer | | |
| `ADDITIONAL_PORTS` | Additional ports for the server | | |
| `TOKENIZER_MODE` | Tokenizer mode | "auto" | "auto", "slow" |
| `LOAD_FORMAT` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
| `DTYPE` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
| `CONTEXT_LENGTH` | Model's maximum context length | | |
| `QUANTIZATION` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
| `SERVED_MODEL_NAME` | Override model name in API | | |
| `CHAT_TEMPLATE` | Chat template name or path | | |
| `MEM_FRACTION_STATIC` | Fraction of memory for static allocation | | |
| `MAX_RUNNING_REQUESTS` | Maximum number of running requests | | |
| `MAX_NUM_REQS` | Maximum requests in memory pool | | |
| `MAX_TOTAL_TOKENS` | Maximum tokens in memory pool | | |
| `CHUNKED_PREFILL_SIZE` | Max tokens in chunk for chunked prefill | | |
| `MAX_PREFILL_TOKENS` | Max tokens in prefill batch | | |
| `SCHEDULE_POLICY` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy | | |
| `TENSOR_PARALLEL_SIZE` | Tensor parallelism size | | |
| `STREAM_INTERVAL` | Streaming interval in token length | | |
| `RANDOM_SEED` | Random seed | | |
| `LOG_LEVEL` | Logging level for all loggers | | |
| `LOG_LEVEL_HTTP` | Logging level for HTTP server | | |
| `API_KEY` | API key for the server | | |
| `FILE_STORAGE_PTH` | Path of file storage in backend | | |
| `DATA_PARALLEL_SIZE` | Data parallelism size | | |
| `LOAD_BALANCE_METHOD` | Load balancing strategy | | "round_robin", "shortest_queue" |
| `NCCL_INIT_ADDR` | NCCL init address for multi-node | | |
| `NNODES` | Number of nodes | | |
| `NODE_RANK` | Node rank | | |

**Boolean Flags** (set to "true", "1", or "yes" to enable):

| Flag | Description |
|------|-------------|
| `SKIP_TOKENIZER_INIT` | Skip tokenizer init |
| `TRUST_REMOTE_CODE` | Allow custom models from Hub |
| `LOG_REQUESTS` | Log inputs and outputs of requests |
| `SHOW_TIME_COST` | Show time cost of custom marks |
| `DISABLE_FLASHINFER` | Disable flashinfer attention kernels |
| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels |
| `DISABLE_RADIX_CACHE` | Disable RadixAttention for prefix caching |
| `DISABLE_REGEX_JUMP_FORWARD` | Disable regex jump-forward |
| `DISABLE_CUDA_GRAPH` | Disable cuda graph |
| `DISABLE_DISK_CACHE` | Disable disk cache |
| `ENABLE_TORCH_COMPILE` | Optimize model with torch.compile |
| `ENABLE_P2P_CHECK` | Enable P2P check for GPU access |
| `ENABLE_MLA` | Enable Multi-head Latent Attention |
| `ATTENTION_REDUCE_IN_FP32` | Cast attention results to fp32 |
| `EFFICIENT_WEIGHT_LOAD` | Enable memory efficient weight loading |

## 💡 | Note:
This is an initial and preview phase of the worker's development.
| Flag | Description |
| ----------------------------- | ----------------------------------------- |
| `SKIP_TOKENIZER_INIT` | Skip tokenizer init |
| `TRUST_REMOTE_CODE` | Allow custom models from Hub |
| `LOG_REQUESTS` | Log inputs and outputs of requests |
| `SHOW_TIME_COST` | Show time cost of custom marks |
| `DISABLE_FLASHINFER` | Disable flashinfer attention kernels |
| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels |
| `DISABLE_RADIX_CACHE` | Disable RadixAttention for prefix caching |
| `DISABLE_REGEX_JUMP_FORWARD` | Disable regex jump-forward |
| `DISABLE_CUDA_GRAPH` | Disable cuda graph |
| `DISABLE_DISK_CACHE` | Disable disk cache |
| `ENABLE_TORCH_COMPILE` | Optimize model with torch.compile |
| `ENABLE_P2P_CHECK` | Enable P2P check for GPU access |
| `ENABLE_MLA` | Enable Multi-head Latent Attention |
| `ATTENTION_REDUCE_IN_FP32` | Cast attention results to fp32 |
| `EFFICIENT_WEIGHT_LOAD` | Enable memory efficient weight loading |

## 💡 | Note:

This is an initial and preview phase of the worker's development.
23 changes: 0 additions & 23 deletions builder/setup.sh

This file was deleted.

32 changes: 0 additions & 32 deletions docker-bake.hcl

This file was deleted.

2 changes: 1 addition & 1 deletion builder/requirements.txt → requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
ray
pandas
pyarrow
runpod~=1.7.0
runpod>=1.7.7
huggingface-hub
packaging
typing-extensions==4.7.1
Expand Down
Loading

0 comments on commit d1c92f2

Please sign in to comment.