-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use fastsafetensors #99
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Takeshi Yoshimura <[email protected]>
Thanks @takeshi-yoshimura for the great work! I have a few initial thoughts/questions: I feel it would be better for the fastsafetensors library to be fully independent and not contain any inference server specific code, i.e. include the code from https://github.com/foundation-model-stack/fastsafetensors/blob/main/fastsafetensors/connectors/tgis_weights.py as a file in this PR.
Does this mean it won't work for quantized models? Is this particularly difficult or just something that you haven't implemented yet?
This optimization is only beneficial in some cases (for example I don't think its helpful when using paged attention). But in the cases that it is effective, we probably wouldn't want to use the faster loading because runtime performance is more important.
I think this is more of a problem and we'd need to understand and resolve it. Is the GPU memory imbalance you mentioned just during the model loading phase, or it remains the case at runtime (I don't really understand if it's the latter).
Could these be auto-configured? Ideally this would not be something that users have to tune on a case-by-case basis. |
Signed-off-by: Takeshi Yoshimura <[email protected]>
Thanks @njhill.
I have just pushed the change including upgrading fastsafetensors to be 0.1.1 for minor bug fixes. I noticed that the Weight class has slightly changed since what fastsafetensors has.
I just haven't implemented yet and I believe there are no technical difficulties. The Weight class just raises a NotImplementedError to call
In that case, users can just turn off GDS by setting FST_NOGDS=1. Fastsafetensors still provide optimized file I/O and GPU offloading without GDS.
If I remember correctly, the failure happened after the model loading phase at https://github.com/IBM/text-generation-inference/blob/main/server/text_generation_server/utils/memory_characterizer.py#L360. fastsafetensors returns unused GPU memory to torch's CUDACachingAllocator, so it should not cause the imbalance or memory leaks. But it may cause fragmentations and raise OoM at model loading with TP. I encountered the problem only when I tried bloom with TP=8. I could not reproduce the issue with llama-13b and falcon-40b on TP=2. I will investigate memory trace later to make sure if this happens on current TGIS and fastsafetensors.
Yes, we can provide auto-tuning configuration via environmental variables FST_CONFIG=auto to set things according to number of CPUs and cache sizes. I will add the feature later. |
Signed-off-by: Takeshi Yoshimura <[email protected]>
I have added an auto-configuration. Thanks. |
Motivation
Fastsafetensors (https://github.com/foundation-model-stack/fastsafetensors/) significantly reduces model loading time of .safetensors files on NVMe SSDs parallelizing file copies for maximum throughput, and offloading tensor sharding and type conversions to GPUs. Additionally, Fastsafetensors supports GPU Direct Storage (GDS) to further optimize cold startups from NVMe SSDs.
For instance, in a comparison between Fastsafetensors and the current loader, the cold startup time of bigscience/bloom at 4 NVMe and 8 A100 GPUs was reduced from 95 seconds to 28.4 seconds with Fastsafetensors. Similarly, the hot startup time (loading files at page cache) of llama-2-13b-chat was reduced from 8.7 seconds to 4.7 seconds with a single GPU using Fastsafetensors
Modifications
Fastsafetensors is provided as a PyPi package. It includes a compatible Weight class to load weights (
fastsafetensors.connectors.tgis_weights.Weight
). I modified tgis_native.py to replace the Weight class with environmental variables since I do not want to largely modify codes to add fastsafetensors configurations. Currently, the class hasget_tensor
andget_sahrded
and others except for quantization.I added two tuning knobs for the number of copy threads and bounce buffer sizes. They should be changed according to the available CPUs in a NUMA node and the size of the CPU cache, especially in hot startups, i.e., when files are on DRAM. My experiences of using Icelake servers implied that the size of bounce buffers should be L2 cache size (160MB in this case).
The package depends on libcufile.so and libnuma.so. So, we need to modify Dockerfile to add them in the final image.
For tensor parallel inference, Fastsafetensors first loads multiple files in a round-robin fashion for multiple GPUs. It then reuses torch.distributed methods to obtain or shard weights from one of the ranks. As a result, every Weight method, including
get_tensor
, may load tensors from a remote rank. This requires ensuring that both a sender rank and a getter rank are invoked to get a tensor, unlike before. While most cases do not require modifying existing classes,TensorParallelRowLinear
has an if-else statement for rank 0'sget_tensor
. To address this, I changedget_tensor
topush_tensor
, which sets a sender and receiver rank (while other ranks do nothing).Result
As described in the Motivation section, we observed speedups in both cold and hot startups of TGIS with various models. So far, I have tested with falcon-40b, bloom, and llama.
However, GDS limits the available optimization using
PYTORCH_CUDA_ALLOC_CONF
. Currently, we cannot set it to "expandable_segments:True" to use cuMemmap, which is not available with peer-to-peer transfers.Additionally, using tensor parallel with Fastsafetensors can cause deadlocks at memory estimation. In my experiments, I always set
ESTIMATE_MEMORY=off
to avoid this issue. This is because Fastsafetensors can have different GPU memory consumptions among ranks, which can confuse the out-of-memory detection logic. Some ranks may continue running forward() while others do not, which fail to callget_tensor
and others in the same order at every rank. The memory estimation seems to assume that every GPU consumes GPU memory evenly, which is not always the case with Fastsafetensors.Related Issues
NA