How to reduce latency of litellm proxy #4298

ManivannanGuru · 2024-06-20T07:15:34Z

ManivannanGuru
Jun 20, 2024

Hi,

We have a chabot where customers can ask questions, behind the scene we use LLMs (llama3 in this case) hosted via TGI inference, with Java backend

As of now java code talks directly to the LLM (via TGI inference endpoints), but we are planning to implement LLM observability. So we were thinking of using LiteLLM proxy since it has integration with many observability tools like langfuse etc...

But the catch is the moment we introduce litellm in the picture we have a huge delay in the response time, for example direct API calls without litellm has a response time of ~2 seconds, but with litellm it is ~4.5 seconds.

We have tested just using litellm proxy (with & without postgres db), in both the cases issue remains the same.

FYI

litellm.proxy.com -> here we have hosted the litellm
models.yourdomain.com -> here we have hosted the LLMs

time curl --location '[https://litellm.proxy.com/chat/completions'](https://litellm.proxy.com/chat/completions%27) \

--header 'Content-Type: application/json' \

--header 'Authorization: Bearer sk-xxxxxxxxxxxxxyyyyyyyyy' \

--data '{

"model": "llama3",

"temperature":0,

"messages": [

{

"role": "system",

"content": "You are an helpful AI assistant"

},

{

"role": "user",

"content": "what is 2 + 3"

}

],

"metadata":{

"session_id": "session-proxy",

"trace_user_id":"user-proxy",

"trace_name":"proxymetrics"

}

}'

client(java) --> litellm --> llm : ~4.3s

Above API is sent from java client to litellm and we got the response in ~4.3s

time curl --location '[https://models.yourdomain.com/v1/chat/completions'](https://models.yourdomain.com/v1/chat/completions%27) \

--header 'Content-Type: application/json' \

--data '{

"model": "llama3",

"temperature": 0,

"messages": [

{

"role": "system",

"content": "You are an helpful AI assistant"

},

{

"role": "user",

"content": "what is 2 + 3"

}

]

}'

client --> llm : ~1.6s

Above API call is sent from java client to LLM directly and we got the response in ~1.6s, so here you can see there is approx 3 seconds delay

We have also configured litellm in local machine and below are the results

time curl --location '[http://0.0.0.0:4000/chat/completions'](http://0.0.0.0:4000/chat/completions%27) \

--header 'Content-Type: application/json' \

--header 'Authorization: Bearer sk-xxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyy' \

--data '{

"model": "llama3",

"temperature":0,

"messages": [

{

"role": "system",

"content": "You are an helpful AI assistant"

},

{

"role": "user",

"content": "what is 2 + 3"

}

],

"metadata":{

"session_id": "session-proxy",

"trace_user_id":"user-proxy",

"trace_name":"proxymetrics"

}

}'

local --> local litellm (but the models are hosted in the sever not in local) : ~4s

model_list:

model_name: llama3

litellm_params:

model: huggingface/llama3

api_base: https://models.yourdomain.com

LiteLLM: Version = 1.40.17

Would be really great if anyone can help us on this.

ishaan-jaff · 2024-06-20T18:06:21Z

ishaan-jaff
Jun 20, 2024
Maintainer

Hi @ManivannanGuru - when running our load tests we see a latency of 50ms added

0 replies

ishaan-jaff · 2024-06-20T18:06:36Z

ishaan-jaff
Jun 20, 2024
Maintainer

I'm investigating HF TGI endpoints right now

0 replies

ishaan-jaff · 2024-06-20T18:21:58Z

ishaan-jaff
Jun 20, 2024
Maintainer

This is what I see with litellm

time curl --location 'http://localhost:4000/chat/completions' \
    --header 'Authorization: Bearer sk-jaDVJ0VGUArOp_Jj_amb3Q' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "hf-model",
    "user": "yoooo1",
    "messages": [
        {
        "role": "user",
        "content": "hell"
        }
    ]
}'
{"id":"chatcmpl-5f44ac28-3a7f-47a6-b60a-6e9f0a9e516b","choices":[{"finish_reason":"length","index":0,"message":{"content":"\nHello, I'm not capable of responding with a greeting in any language other than English. However, if you meant to say \"hi\" in another language, please let me know which one, and I'll do my best to assist you. Some common greetings in various languages include:\n\n- Spanish: Hola\n- French: Bonjour\n- German: Hallo\n- Italian: Ciao\n- Portuguese: Olá","role":"assistant"}}],"created":1718907019,"model":"HuggingFaceH4/zephyr-7b-beta","object":"chat.completion","system_fingerprint":null,"usage":{"prompt_tokens":10,"completion_tokens":87,"total_tokens":97}}curl --location 'http://localhost:4000/chat/completions' --header  --header    0.00s user 0.00s system 0% cpu 1.273 total

This takes 1.273 seconds

2 replies

deepakdeore2004 Jun 21, 2024

This is what I see with litellm

time curl --location 'http://localhost:4000/chat/completions' \
    --header 'Authorization: Bearer sk-jaDVJ0VGUArOp_Jj_amb3Q' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "hf-model",
    "user": "yoooo1",
    "messages": [
        {
        "role": "user",
        "content": "hell"
        }
    ]
}'
{"id":"chatcmpl-5f44ac28-3a7f-47a6-b60a-6e9f0a9e516b","choices":[{"finish_reason":"length","index":0,"message":{"content":"\nHello, I'm not capable of responding with a greeting in any language other than English. However, if you meant to say \"hi\" in another language, please let me know which one, and I'll do my best to assist you. Some common greetings in various languages include:\n\n- Spanish: Hola\n- French: Bonjour\n- German: Hallo\n- Italian: Ciao\n- Portuguese: Olá","role":"assistant"}}],"created":1718907019,"model":"HuggingFaceH4/zephyr-7b-beta","object":"chat.completion","system_fingerprint":null,"usage":{"prompt_tokens":10,"completion_tokens":87,"total_tokens":97}}curl --location 'http://localhost:4000/chat/completions' --header  --header    0.00s user 0.00s system 0% cpu 1.273 total

This takes 1.273 seconds

and how much is the duration while bypassing litellm?

ishaan-jaff Jun 22, 2024
Maintainer

Hi @ManivannanGuru @deepakdeore2004 - can we hop on a call - our added latency in prod is 50ms and we're unable to repro.

I can help you faster over a call. Here's my cal https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce latency of litellm proxy #4298

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

This takes 1.273 seconds

{{title}}

Select a reply

How to reduce latency of litellm proxy #4298

ManivannanGuru Jun 20, 2024

Replies: 3 comments · 2 replies

ishaan-jaff Jun 20, 2024 Maintainer

ishaan-jaff Jun 20, 2024 Maintainer

ishaan-jaff Jun 20, 2024 Maintainer

This takes 1.273 seconds

deepakdeore2004 Jun 21, 2024

This takes 1.273 seconds

ishaan-jaff Jun 22, 2024 Maintainer

ManivannanGuru
Jun 20, 2024

Replies: 3 comments 2 replies

ishaan-jaff
Jun 20, 2024
Maintainer

ishaan-jaff
Jun 20, 2024
Maintainer

ishaan-jaff
Jun 20, 2024
Maintainer

ishaan-jaff Jun 22, 2024
Maintainer