You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modify RobertaEmbedding forward as custom op method (HabanaAI#996)
This is custom op change as PR
HabanaAI#786 follow-up.
Removed RobertaEmbedding class from model file and implemented it as
CustomOp class in new file.
forward_cuda() is the original forward function and forward_hpu() is our
specific change.
<!--- pyml disable-next-line no-emphasis-as-heading -->
---------
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Copy file name to clipboardExpand all lines: README_GAUDI.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -386,7 +386,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
386
386
387
387
-`PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used, if `1` PyTorch Lazy backend for Gaudi will be used. `1` is the default.
388
388
-`PT_HPU_ENABLE_LAZY_COLLECTIVES` must be set to `true` for tensor parallel inference with HPU Graphs.
389
-
-`PT_HPUGRAPH_DISABLE_TENSOR_CACHE` must be set to `false` for llavaand qwen models.
389
+
-`PT_HPUGRAPH_DISABLE_TENSOR_CACHE` must be set to `false` for llava, qwen and roberta models.
390
390
-`VLLM_PROMPT_USE_FLEX_ATTENTION` is enabled only for llama model, and allows to use torch.nn.attention.flex_attention instead of FusedSDPA. Note, this requires `VLLM_PROMPT_USE_FUSEDSDPA=0`
391
391
392
392
# Quantization, FP8 Inference and Model Calibration Process
0 commit comments