get rid of like 80% of the code

runpod-workers · Feb 10, 2025 · d1c92f2 · d1c92f2
1 parent 9b901b9
commit d1c92f2
Show file tree

Hide file tree

Showing 10 changed files with 394 additions and 487 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -1,49 +1,42 @@
 FROM nvidia/cuda:12.1.0-base-ubuntu22.04 
 
+ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update -y \
+    && apt-get dist-upgrade -y \
     && apt-get install -y python3-pip
 
 RUN ldconfig /usr/local/cuda-12.1/compat/
 
-# Install Python dependencies
-COPY builder/requirements.txt /requirements.txt
+# install sglang's dependencies
+
+# EFRON:
+# these guys are unbelivably huge - >80GiB. Took well over ten minutes to install on my machine and used 28GiB(!) of RAM.
+# we should consider having a base image with them pre-installed or seeing if we can knock it down a little bit.
+RUN python3 -m pip install "sglang[all]" 
+RUN python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3
+
+
+# install _our_ dependencies
 RUN --mount=type=cache,target=/root/.cache/pip \
     python3 -m pip install --upgrade pip && \
     python3 -m pip install --upgrade -r /requirements.txt
 
-RUN python3 -m pip install "sglang[all]" && \
-    python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3
-
-# Setup for Option 2: Building the Image with the Model included
-ARG MODEL_NAME=""
-ARG TOKENIZER_NAME=""
-ARG BASE_PATH="/runpod-volume"
-ARG QUANTIZATION=""
-ARG MODEL_REVISION=""
-ARG TOKENIZER_REVISION=""
-
-ENV MODEL_NAME=$MODEL_NAME \
-    MODEL_REVISION=$MODEL_REVISION \
-    TOKENIZER_NAME=$TOKENIZER_NAME \
-    TOKENIZER_REVISION=$TOKENIZER_REVISION \
-    BASE_PATH=$BASE_PATH \
-    QUANTIZATION=$QUANTIZATION \
-    HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets" \
-    HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub" \
-    HF_HOME="${BASE_PATH}/huggingface-cache/hub" \
-    HF_HUB_ENABLE_HF_TRANSFER=1 
-
-ENV PYTHONPATH="/:/vllm-workspace"
-
-
-COPY src /src
-RUN --mount=type=secret,id=HF_TOKEN,required=false \
-    if [ -f /run/secrets/HF_TOKEN ]; then \
-        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
-    fi && \
-    if [ -n "$MODEL_NAME" ]; then \
-        python3 /src/download_model.py; \
-    fi
-
-# Start the handler
-CMD ["python3", "/src/handler.py"]
+RUN mkdir app
+COPY requirements.txt ./app/requirements.txt
+
+# EFRON: no idea what this is doing: leaving it in in case it's important
+ENV BASE_PATH=$BASE_PATH 
+ENV HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets" 
+ENV HF_HOME="${BASE_PATH}/huggingface-cache/hub"
+ENV HF_HUB_ENABLE_HF_TRANSFER=1
+ENV HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub"
+ENV MODEL_NAME=$MODEL_NAME
+ENV MODEL_REVISION=$MODEL_REVISION
+ENV QUANTIZATION=$QUANTIZATION
+ENV TOKENIZER_NAME=$TOKENIZER_NAME
+ENV TOKENIZER_REVISION=$TOKENIZER_REVISION
+
+# not sure why this is here: is a vllm-workspace even in our image?
+ENV PYTHONPATH="/:/vllm-workspace" 
+COPY ./src/handler.py ./app/handler.py
+CMD ["python3", "./app/handler.py"] # actually run the handler
diff --git a/README.md b/README.md
@@ -3,25 +3,24 @@
 <h1> SgLang Worker</h1>
 
 🚀 | SGLang is fast serving framework for large language models and vision language models.
+
 </div>
 
 ## RunPod Worker Images
 
 Below is a summary of the available RunPod Worker images, categorized by image stability
 
-| Stable Image Tag                  | Development Image Tag             | 
------------------------------------|-----------------------------------|
- `runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` | 
-
+| Stable Image Tag                    | Development Image Tag            |
+| ----------------------------------- | -------------------------------- |
+| `runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` |
 
 ## 📖 | Getting Started
 
 1. Clone this repository.
-2. Build a docker image - ```docker build -t <your_username>:worker-sglang:v1 .```
-3. ```docker push <your_username>:worker-sglang:v1```
+2. Build a docker image - `docker build -t <your_username>:worker-sglang:v1 .`
+3. `docker push <your_username>:worker-sglang:v1`
 
-
-***Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint***: 
+**_Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint_**:
 
 ```
 import runpod
@@ -38,10 +37,11 @@ run_request = endpoint.run({"your_model_input_key": "your_model_input_value"})
 print(run_request.status())
 
 # Get the output of the endpoint run request, blocking until the run is complete
-print(run_request.output()) 
+print(run_request.output())
 ```
 
 ### OpenAI compatible API
+
 ```python
 from openai import OpenAI
 import os
@@ -54,34 +54,35 @@ client = OpenAI(
 ```
 
 `Chat Completions (Non-Streaming)`
+
 ```python
 response = client.chat.completions.create(
     model="meta-llama/Meta-Llama-3-8B-Instruct",
     messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
     temperature=0,
     max_tokens=100,
-    
+
 )
 print(f"Response: {response}")
 ```
 
 `Chat Completions (Streaming)`
+
 ```python
 response_stream = client.chat.completions.create(
     model="meta-llama/Meta-Llama-3-8B-Instruct",
     messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
     temperature=0,
     max_tokens=100,
     stream=True
-    
+
 )
 for response in response_stream:
     print(response.choices[0].delta.content or "", end="", flush=True)
 ```
 
-
-
 ## SGLang Server Configuration
+
 When launching an endpoint, you can configure the SGLang server using environment variables. These variables allow you to customize various aspects of the server's behavior without modifying the code.
 
 ### How to Use
@@ -91,63 +92,64 @@ The SGLang server will read these variables at startup and configure itself acco
 If a variable is not set, the server will use its default value.
 
 ### Available Environment Variables
-The following table lists all available environment variables for configuring the SGLang server:
 
+The following table lists all available environment variables for configuring the SGLang server:
 
-| Environment Variable | Description | Default | Options |
-|----------------------|-------------|---------|---------|
-| `MODEL_PATH` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
-| `HOST` | Host of the server | "0.0.0.0" | |
-| `PORT` | Port of the server | 30000 | |
-| `TOKENIZER_PATH` | Path of the tokenizer | | |
-| `ADDITIONAL_PORTS` | Additional ports for the server | | |
-| `TOKENIZER_MODE` | Tokenizer mode | "auto" | "auto", "slow" |
-| `LOAD_FORMAT` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
-| `DTYPE` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
-| `CONTEXT_LENGTH` | Model's maximum context length | | |
-| `QUANTIZATION` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
-| `SERVED_MODEL_NAME` | Override model name in API | | |
-| `CHAT_TEMPLATE` | Chat template name or path | | |
-| `MEM_FRACTION_STATIC` | Fraction of memory for static allocation | | |
-| `MAX_RUNNING_REQUESTS` | Maximum number of running requests | | |
-| `MAX_NUM_REQS` | Maximum requests in memory pool | | |
-| `MAX_TOTAL_TOKENS` | Maximum tokens in memory pool | | |
-| `CHUNKED_PREFILL_SIZE` | Max tokens in chunk for chunked prefill | | |
-| `MAX_PREFILL_TOKENS` | Max tokens in prefill batch | | |
-| `SCHEDULE_POLICY` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
-| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy | | |
-| `TENSOR_PARALLEL_SIZE` | Tensor parallelism size | | |
-| `STREAM_INTERVAL` | Streaming interval in token length | | |
-| `RANDOM_SEED` | Random seed | | |
-| `LOG_LEVEL` | Logging level for all loggers | | |
-| `LOG_LEVEL_HTTP` | Logging level for HTTP server | | |
-| `API_KEY` | API key for the server | | |
-| `FILE_STORAGE_PTH` | Path of file storage in backend | | |
-| `DATA_PARALLEL_SIZE` | Data parallelism size | | |
-| `LOAD_BALANCE_METHOD` | Load balancing strategy | | "round_robin", "shortest_queue" |
-| `NCCL_INIT_ADDR` | NCCL init address for multi-node | | |
-| `NNODES` | Number of nodes | | |
-| `NODE_RANK` | Node rank | | |
+| Environment Variable        | Description                              | Default                               | Options                                                                                   |
+| --------------------------- | ---------------------------------------- | ------------------------------------- | ----------------------------------------------------------------------------------------- |
+| `MODEL_PATH`                | Path of the model weights                | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID                                                      |
+| `HOST`                      | Host of the server                       | "0.0.0.0"                             |                                                                                           |
+| `PORT`                      | Port of the server                       | 30000                                 |                                                                                           |
+| `TOKENIZER_PATH`            | Path of the tokenizer                    |                                       |                                                                                           |
+| `ADDITIONAL_PORTS`          | Additional ports for the server          |                                       |                                                                                           |
+| `TOKENIZER_MODE`            | Tokenizer mode                           | "auto"                                | "auto", "slow"                                                                            |
+| `LOAD_FORMAT`               | Format of model weights to load          | "auto"                                | "auto", "pt", "safetensors", "npcache", "dummy"                                           |
+| `DTYPE`                     | Data type for weights and activations    | "auto"                                | "auto", "half", "float16", "bfloat16", "float", "float32"                                 |
+| `CONTEXT_LENGTH`            | Model's maximum context length           |                                       |                                                                                           |
+| `QUANTIZATION`              | Quantization method                      |                                       | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
+| `SERVED_MODEL_NAME`         | Override model name in API               |                                       |                                                                                           |
+| `CHAT_TEMPLATE`             | Chat template name or path               |                                       |                                                                                           |
+| `MEM_FRACTION_STATIC`       | Fraction of memory for static allocation |                                       |                                                                                           |
+| `MAX_RUNNING_REQUESTS`      | Maximum number of running requests       |                                       |                                                                                           |
+| `MAX_NUM_REQS`              | Maximum requests in memory pool          |                                       |                                                                                           |
+| `MAX_TOTAL_TOKENS`          | Maximum tokens in memory pool            |                                       |                                                                                           |
+| `CHUNKED_PREFILL_SIZE`      | Max tokens in chunk for chunked prefill  |                                       |                                                                                           |
+| `MAX_PREFILL_TOKENS`        | Max tokens in prefill batch              |                                       |                                                                                           |
+| `SCHEDULE_POLICY`           | Request scheduling policy                |                                       | "lpm", "random", "fcfs", "dfs-weight"                                                     |
+| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy      |                                       |                                                                                           |
+| `TENSOR_PARALLEL_SIZE`      | Tensor parallelism size                  |                                       |                                                                                           |
+| `STREAM_INTERVAL`           | Streaming interval in token length       |                                       |                                                                                           |
+| `RANDOM_SEED`               | Random seed                              |                                       |                                                                                           |
+| `LOG_LEVEL`                 | Logging level for all loggers            |                                       |                                                                                           |
+| `LOG_LEVEL_HTTP`            | Logging level for HTTP server            |                                       |                                                                                           |
+| `API_KEY`                   | API key for the server                   |                                       |                                                                                           |
+| `FILE_STORAGE_PTH`          | Path of file storage in backend          |                                       |                                                                                           |
+| `DATA_PARALLEL_SIZE`        | Data parallelism size                    |                                       |                                                                                           |
+| `LOAD_BALANCE_METHOD`       | Load balancing strategy                  |                                       | "round_robin", "shortest_queue"                                                           |
+| `NCCL_INIT_ADDR`            | NCCL init address for multi-node         |                                       |                                                                                           |
+| `NNODES`                    | Number of nodes                          |                                       |                                                                                           |
+| `NODE_RANK`                 | Node rank                                |                                       |                                                                                           |
 
 **Boolean Flags** (set to "true", "1", or "yes" to enable):
 
-| Flag | Description |
-|------|-------------|
-| `SKIP_TOKENIZER_INIT` | Skip tokenizer init |
-| `TRUST_REMOTE_CODE` | Allow custom models from Hub |
-| `LOG_REQUESTS` | Log inputs and outputs of requests |
-| `SHOW_TIME_COST` | Show time cost of custom marks |
-| `DISABLE_FLASHINFER` | Disable flashinfer attention kernels |
-| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels |
-| `DISABLE_RADIX_CACHE` | Disable RadixAttention for prefix caching |
-| `DISABLE_REGEX_JUMP_FORWARD` | Disable regex jump-forward |
-| `DISABLE_CUDA_GRAPH` | Disable cuda graph |
-| `DISABLE_DISK_CACHE` | Disable disk cache |
-| `ENABLE_TORCH_COMPILE` | Optimize model with torch.compile |
-| `ENABLE_P2P_CHECK` | Enable P2P check for GPU access |
-| `ENABLE_MLA` | Enable Multi-head Latent Attention |
-| `ATTENTION_REDUCE_IN_FP32` | Cast attention results to fp32 |
-| `EFFICIENT_WEIGHT_LOAD` | Enable memory efficient weight loading |
-
-## 💡 | Note: 
-This is an initial and preview phase of the worker's development. 
+| Flag                          | Description                               |
+| ----------------------------- | ----------------------------------------- |
+| `SKIP_TOKENIZER_INIT`         | Skip tokenizer init                       |
+| `TRUST_REMOTE_CODE`           | Allow custom models from Hub              |
+| `LOG_REQUESTS`                | Log inputs and outputs of requests        |
+| `SHOW_TIME_COST`              | Show time cost of custom marks            |
+| `DISABLE_FLASHINFER`          | Disable flashinfer attention kernels      |
+| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels       |
+| `DISABLE_RADIX_CACHE`         | Disable RadixAttention for prefix caching |
+| `DISABLE_REGEX_JUMP_FORWARD`  | Disable regex jump-forward                |
+| `DISABLE_CUDA_GRAPH`          | Disable cuda graph                        |
+| `DISABLE_DISK_CACHE`          | Disable disk cache                        |
+| `ENABLE_TORCH_COMPILE`        | Optimize model with torch.compile         |
+| `ENABLE_P2P_CHECK`            | Enable P2P check for GPU access           |
+| `ENABLE_MLA`                  | Enable Multi-head Latent Attention        |
+| `ATTENTION_REDUCE_IN_FP32`    | Cast attention results to fp32            |
+| `EFFICIENT_WEIGHT_LOAD`       | Enable memory efficient weight loading    |
+
+## 💡 | Note:
+
+This is an initial and preview phase of the worker's development.
diff --git a/builder/setup.sh b/builder/setup.sh
diff --git a/docker-bake.hcl b/docker-bake.hcl
diff --git a/builder/requirements.txt → requirements.txt b/builder/requirements.txt → requirements.txt
@@ -1,7 +1,7 @@
 ray
 pandas
 pyarrow
-runpod~=1.7.0
+runpod>=1.7.7
 huggingface-hub
 packaging
 typing-extensions==4.7.1