New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Philip's blog #26

Open

p208p2002 opened this issue Sep 12, 2023 · 0 comments

Labels

fast-api-gpu-inference Gitalk

Owner

p208p2002 commented Sep 12, 2023

https://blog.philip-huang.tech/?page=fast-api-gpu-inference

- tags: gpu fast-api inference optimize - date: 2023/09/12

GPU 推論是計算密集型任務，一個推論往往是秒鐘起跳，而 Fast API 對多個請求的處理主要是基於異步(async)，其本質是線程(threading)，也就是說在 Fast API 中直接執行推論任務會卡住其他請求。

這時可以使用多進程處理(multi-processing)將推論任務丟到子進程中，這邊會有兩點需要注意:

由於pytorch的限制，建立子進程的方法無法使用預設的fork需要使用spawn。
為了避免 GPU OOM (out of memory) 需要使用一個全域的進程池，用來限制同時推論任務的最大處理數量。

實踐

以下是一段使用 Fast API+GPT2 做推論的最小化例子。

  main.py
import asyncio
import concurrent.futures
from fastapi import

The text was updated successfully, but these errors were encountered:

p208p2002 added Gitalk fast-api-gpu-inference labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment