[CT] Fix CT Config to honor `fp8_inc` KV cache dtype #929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

yiliu30 wants to merge 3 commits into vllm-project:main from yiliu30:fix-llmc-kv

+38 −2

vllm_gaudi/ops/hpu_compressed_tensors.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -14,8 +14,13 @@ @@
                                                PackedvLLMParameter, RowvLLMParameter)
     from vllm.model_executor.layers.quantization.compressed_tensors import (compressed_tensors)
     from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import (  # noqa: E501
-        CompressedTensorsLinearMethod as OrigCompressedTensorsLinearMethod, CompressedTensorsConfig,
-        CompressedTensorsMoEMethod, CompressedTensorsKVCacheMethod)
+        CompressedTensorsLinearMethod as OrigCompressedTensorsLinearMethod)
+    from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import (
+        CompressedTensorsConfig,
+        CompressedTensorsMoEMethod,
+        CompressedTensorsKVCacheMethod,
+        SparsityCompressionConfig,
+    )
     from vllm.model_executor.layers.quantization.compressed_tensors import (compressed_tensors_moe)
     from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (  # noqa: E501
         CompressedTensorsScheme, CompressedTensorsWNA16)
@@ Expand Down Expand Up @@
     class HPUCompressedTensorsConfig(CompressedTensorsConfig):
+        def __init__(
+            self,
+            target_scheme_map: dict[str, Any],
+            ignore: list[str],
+            quant_format: str,
+            sparsity_scheme_map: dict[str, SparsityCompressionConfig],
+            sparsity_ignore_list: list[str],
+            kv_cache_scheme: dict[str, Any] | None = None,
+            config: dict[str, Any] | None = None,
+            transform_config: dict[str, Any] | None = None,
+            total_num_heads: int | None = None,
+            total_num_kv_heads: int | None = None,
+        ):
+            super().__init__(
+                target_scheme_map,
+                ignore,
+                quant_format,
+                sparsity_scheme_map,
+                sparsity_ignore_list,
+                kv_cache_scheme,
+                config,
+                transform_config,
+                total_num_heads,
+                total_num_kv_heads,
+            )
+            # Fix https://github.com/vllm-project/vllm/pull/30141
+            # LLMC overrides the `kv_cache_dtype` to 'fp8', while HPU uses 'fp8_inc'.
+            if getattr(self, "kv_cache_scheme", None) is not None:
+                self.kv_cache_dtype = "fp8_inc"
+                self.kv_cache_scheme = None
         def get_quant_method(
             self,
             layer: torch.nn.Module,
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT] Fix CT Config to honor `fp8_inc` KV cache dtype #929

Diff view

Diff view

There are no files selected for viewing

Uh oh!

[CT] Fix CT Config to honor fp8_inc KV cache dtype #929

Are you sure you want to change the base?

[CT] Fix CT Config to honor fp8_inc KV cache dtype #929

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

[CT] Fix CT Config to honor `fp8_inc` KV cache dtype #929

[CT] Fix CT Config to honor `fp8_inc` KV cache dtype #929