feat: add per-model FP8 layerwise casting for VRAM reduction#8945
Draft
Pfannkuchensack wants to merge 5 commits intoinvoke-ai:mainfrom
Draft
feat: add per-model FP8 layerwise casting for VRAM reduction#8945Pfannkuchensack wants to merge 5 commits intoinvoke-ai:mainfrom
Pfannkuchensack wants to merge 5 commits intoinvoke-ai:mainfrom
Conversation
Add fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss. Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter. Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes
Add per-model FP8 storage toggle in Model Manager default settings for both main models and control adapter models. When enabled, model weights are stored in FP8 format in VRAM (~50% savings) and cast layer-by-layer to compute precision during inference via diffusers' enable_layerwise_casting(). Backend: add fp8_storage field to MainModelDefaultSettings and ControlAdapterDefaultSettings, apply FP8 layerwise casting in all relevant model loaders (SD, SDXL, FLUX, CogView4, Z-Image, ControlNet, T2IAdapter, VAE). Gracefully skips non-ModelMixin models (custom checkpoint loaders, GGUF, BnB). Frontend: add FP8 Storage switch to model default settings panels with InformationalPopover, translation keys, and proper form handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FP8 Layerwise Casting - Implementation
Summary
Add per-model
fp8_storageoption to model default settings that enables diffusers'enable_layerwise_casting()to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss.Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes.
Related Issues / Discussions
enable_layerwise_casting()(available in diffusers 0.36.0)QA Instructions
fp8_storage: truein a model'sdefault_settings(via API or Model Manager UI)Test Matrix
fp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- check qualityfp8_storage=true- load and generatefp8_storagefp8_storageis silently ignoredChecklist
What's Newcopy (if doing a release after this PR)