Skip to content

Conversation

CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented Sep 29, 2025

Usage

  1. quantize
lmdeploy lite blocked_fp8 ${model_path} --work-dir ${quantized_model_path} --quant-dtype fp8
  1. test case

NOTE: We can use either pytorch or turbomind backend for FP8 inference. Here we take pytorch backend as an example.

from lmdeploy import pipeline, PytorchEngineConfig

model_path = "OpenGVLab/InternVL3_5-8B-FP8"

if __name__ == '__main__':
    engine_config = PytorchEngineConfig(tp=1)
    pipe = pipeline(model_path, backend_config=engine_config)
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Accuracy

Dataset: OCRBench
Model: InternVL3.5-8B (FP8), InternVL3_5-30B-A3B (FP8)

Backend InternVL3.5-8B InternVL3.5-8B-FP8 InternVL3_5-30B-A3B InternVL3_5-30B-A3B-FP8
TurboMind 84.3 84.1 88.8 88.4
PyTorch 84.3 84.2 88.7 88.1

Tested with VLMEvalKit.

Checklist

  • Align the quantization config with QWen3 / InternS1 FP8
  • Add documents for blocked FP8
  • Verify the FP8 model accuracy
  • Fix quantizations for MOE models
  • Check whether weight_scale_inv modification affects other quant methods / modules

@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review September 30, 2025 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant