Skip to content

feat: add per-model FP8 layerwise casting for VRAM reduction#8945

Draft
Pfannkuchensack wants to merge 5 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/fp8-layerwise-casting
Draft

feat: add per-model FP8 layerwise casting for VRAM reduction#8945
Pfannkuchensack wants to merge 5 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/fp8-layerwise-casting

Conversation

@Pfannkuchensack
Copy link
Collaborator

FP8 Layerwise Casting - Implementation

Summary

Add per-model fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss.

Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes.

Related Issues / Discussions

QA Instructions

  1. Set fp8_storage: true in a model's default_settings (via API or Model Manager UI)
  2. Load the model and generate an image
  3. Verify VRAM usage is reduced compared to normal loading
  4. Verify image quality is acceptable (minimal degradation expected)
  5. Verify Text Encoders are NOT affected (excluded by submodel type filter)
  6. Verify non-CUDA devices gracefully ignore the setting

Test Matrix

  • SD1.5 Diffusers with fp8_storage=true - load and generate
  • SDXL Diffusers with fp8_storage=true - load and generate
  • Flux Diffusers with fp8_storage=true - load and generate
  • Flux2 Diffusers with fp8_storage=true - load and generate
  • CogView4 with fp8_storage=true - load and generate
  • Z-Image Diffusers with fp8_storage=true - load and generate
  • VAE with fp8_storage=true - check quality
  • ControlNet with fp8_storage=true - load and generate
  • VRAM comparison: with vs. without fp8_storage
  • Image quality comparison: FP8 vs fp16/bf16
  • MPS/CPU: verify fp8_storage is silently ignored
  • Flux Checkpoint (custom class): verify FP8 is gracefully skipped (not a ModelMixin)
  • Text Encoder submodels: verify FP8 is NOT applied
  • GGUF/BnB models: verify FP8 is gracefully skipped

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Add fp8_storage option to model default settings that enables
diffusers' enable_layerwise_casting() to store weights in FP8
(float8_e4m3fn) while casting to fp16/bf16 during inference.
This reduces VRAM usage by ~50% per model with minimal quality loss.

Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image,
VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes
Add per-model FP8 storage toggle in Model Manager default settings for
both main models and control adapter models. When enabled, model weights
are stored in FP8 format in VRAM (~50% savings) and cast layer-by-layer
to compute precision during inference via diffusers' enable_layerwise_casting().

Backend: add fp8_storage field to MainModelDefaultSettings and
ControlAdapterDefaultSettings, apply FP8 layerwise casting in all
relevant model loaders (SD, SDXL, FLUX, CogView4, Z-Image, ControlNet,
T2IAdapter, VAE). Gracefully skips non-ModelMixin models (custom
checkpoint loaders, GGUF, BnB).

Frontend: add FP8 Storage switch to model default settings panels with
InformationalPopover, translation keys, and proper form handling.
@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files frontend PRs that change frontend files labels Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files frontend PRs that change frontend files python PRs that change python files v6.13.x

Projects

Status: 6.13.x

Development

Successfully merging this pull request may close these issues.

3 participants