Error with using kv_cache_dtype = fp8_e5m2 #4106

shensimeteor · 2025-03-05T19:38:36Z

shensimeteor
Mar 5, 2025

Hi everyone,

Using tag 0.4.3

I'm trying to use KV cache quantization to speed up my batch inference. However I got below error after adding kv_cache_dtype = fp8_e5m2. My model weights are in bf16. I also turned on online model weights quantization by adding quantization=fp8.

Error

[2025-03-05 06:29:13 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
    return self.forward_extend(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 393, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 299, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 248, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 179, in forward
    attn_output = self.attn(q, k, v, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 69, in forward
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 485, in forward_extend
    o2, s2 = prefill_wrapper_paged.forward_return_lse(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 1689, in forward_return_lse
    return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 1565, in run
    _check_cached_qkv_data_type(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/utils.py", line 204, in _check_cached_qkv_data_type
    raise ValueError(
ValueError: The dtype of k torch.float8_e5m2 does not match the kv_data_type torch.bfloat16 specified in plan function.

Model architecture is similar to LLama2 (GQA)
ServerArgs

ServerArgs(model_path='...', tokenizer_path='...', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='fp8_e5m2', quantization_param_path=None, quantization='fp8', context_length=131072, device='cuda', served_model_name='...', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=16, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=8, stream_interval=1, stream_output=False, random_seed=159151441, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)

Answered by Fridge003

Mar 6, 2025

Hi, #4147 has fixed the flashinfer bug. Please pull the latest main branch and try again.

View full answer

shensimeteor · 2025-03-06T05:44:27Z

shensimeteor
Mar 6, 2025
Author

Being helped on slack channel: the reason is flashinter doesn't yet support kv cache quantization now. Changed to triton backend helps the issue.

1 reply

Fridge003 Mar 6, 2025
Collaborator

Hi, #4147 has fixed the flashinfer bug. Please pull the latest main branch and try again.

Answer selected by Fridge003

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with using kv_cache_dtype = fp8_e5m2 #4106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Error with using kv_cache_dtype = fp8_e5m2 #4106

shensimeteor Mar 5, 2025

Replies: 1 comment · 1 reply

shensimeteor Mar 6, 2025 Author

Fridge003 Mar 6, 2025 Collaborator

shensimeteor
Mar 5, 2025

Replies: 1 comment 1 reply

shensimeteor
Mar 6, 2025
Author

Fridge003 Mar 6, 2025
Collaborator