Skip to content

Conversation

vermavis
Copy link

No description provided.

block_list = attn_metadata.block_list
block_groups = attn_metadata.block_groups
block_mapping = attn_metadata.block_mapping
attn_bias = attn_metadata.attn_bias
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should rename the attributes here. You can do

if not self.sliding_window or attn_metadata.window_block_list is None:
      block_list = attn_metadata.block_list
      block_groups = attn_metadata.block_groups
      block_mapping = attn_metadata.block_mapping
      attn_bias = attn_metadata.attn_bias
else:
    block_list = attn_metadata.window_block_list
    block_groups = attn_metadata.window_block_groups
    block_mapping = attn_metadata.window_block_mapping
    attn_bias = attn_metadata.window_attn_bias

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah problem was this:

(Worker_TP7 pid=102175) ERROR 09-17 19:58:46 [multiproc_executor.py:671] AttributeError: 'TrimmedAttentionMetadata' object has no attribute 'window_block_list'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK nvm, I see the code in vllm-fork does have this missing attribute. Let me fix it in here..

Copy link
Author

@vermavis vermavis Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuechendi I tried PR# 150 which has sliding window support and following change on top of it, but unfortunately the accuracy still looks inaccurate.

diff --git a/vllm_gaudi/attention/backends/hpu_attn.py b/vllm_gaudi/attention/backends/hpu_attn.py
index a558079..1206cbe 100644
--- a/vllm_gaudi/attention/backends/hpu_attn.py
+++ b/vllm_gaudi/attention/backends/hpu_attn.py
@@ -351,8 +351,10 @@ class HPUAttentionImpl(AttentionImpl, torch.nn.Module):
         attn_type: str = AttentionType.DECODER,
         kv_sharing_target_layer_name: Optional[str] = None,
         use_irope: bool = False,
+        sinks: Optional[int] = None,
     ) -> None:
         super(AttentionImpl, self).__init__()
+        self._sinks = sinks
         if kv_sharing_target_layer_name is not None:
             raise NotImplementedError("KV sharing is not currently supported on HPU.")
         if use_irope:
diff --git a/vllm_gaudi/extension/ops.py b/vllm_gaudi/extension/ops.py
index 4e01ec8..905d93d 100644
--- a/vllm_gaudi/extension/ops.py
+++ b/vllm_gaudi/extension/ops.py
@@ -484,7 +484,7 @@ class VllmMixtureOfExpertsOp(torch.nn.Module):
                                                     w12=w1_list,
                                                     w3=w2_list,
                                                     permuted_weights=permuted_weights,
-                                                    activation=activation,
+                                                    activation="silu",
                                                     experts_min=self.experts_min,
                                                     experts_max=self.experts_max)
         for i in range(self.moe_n_slice):

w3=w2_list,
permuted_weights=permuted_weights,
activation=activation,
activation="silu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If silu is necessary, pass through config instead of hard-code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like upstream vllm has this hardcoded to: swigluoai.
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gpt_oss.py#L158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants