Hidden knobs: VAE, guidance_scale, and clip_skip

Hidden knobs: VAE, guidance_scale, and clip_skip

VAE, guidance_scale, and clip_skip are the three parameters most users leave at defaults — each one has a correct value per model family, and getting any wrong causes specific, diagnosable artifacts. This tip covers sdxl-vae-fp16-fix as the only safe SDXL VAE for fp16 inference, Flux guidance_scale split by subject type, and clip_skip behavior across SD 1.5 / SDXL / SD3 / Flux — with a unified cheat sheet.

AI Image Prompt Tip
2026/5/29 · 23:31
1 订阅 · 12 内容
VAE, guidance_scale, and clip_skip each have a correct value per model family. Get any one of them wrong and you'll see washed-out color, plastic-looking skin, or degraded detail — without touching a single prompt word. Here's the per-family breakdown and a copy-paste cheat sheet at the end.

VAE selection for SDXL: one file fixes the NaN problem

VAE (Variational Autoencoder) is the component that decodes the latent image into actual pixels. The wrong one produces specific, diagnosable artifacts.
The original stabilityai/sdxl-vae generates NaN errors when running in fp16 precision — the network's internal activation values exceed what 16-bit floats can represent, so you get black regions, white blowout, or a fully corrupted image. 1 Most consumer GPUs run fp16 by default, which means this affects the majority of SDXL users.
The fix: madebyollin/sdxl-vae-fp16-fix. The author rescaled the network's internal weights to keep activations within fp16 range. Independent benchmark by Kubuxu (2023-07-30) puts the quality loss at effectively zero: LPIPS 0.056 vs 0.055 for the original fp32, SSIM 0.73 in both cases. 2 Speed roughly doubles, VRAM roughly halves, compared to the --no-half-vae workaround that forces the VAE to run in fp32.
SDXL-VAE fp16 NaN output — corrupted image with black and white blowout regions caused by floating-point overflow
The 3 KB file size (vs ~1.5 MB for a normal decode) tells the story: the VAE tried to decode and produced almost nothing. 1
Symptom → diagnosis table:
Visual symptomLikely cause
Purple or washed-out tonesMissing VAE or wrong VAE for the model family
Black patches / white blowoutSDXL-VAE running in fp16 (NaN)
Blurry detail despite high step countVAE decode precision too low
Oversaturated / burnt colorsSD 1.5 VAE (ft-mse-840000) used on SDXL
"If your image looks purple or washed out, the VAE is your problem 99% of the time," writes Angry Shark Studio's ComfyUI troubleshooting guide. 3
Cross-family compatibility is absolute. SDXL-VAE was retrained from scratch; its latent space has nothing in common with the SD 1.x/2.x VAE. madebyollin is direct: "SDXL-VAE was retrained from scratch, and it's not compatible with SD-VAE." 4 Mixing them — SDXL encode + SD decode, or vice versa — produces garbled output, not a graceful degradation. The same applies in reverse: ft-mse-840000 belongs to SD 1.5 and should never be loaded into an SDXL workflow. 5 Flux has a built-in 16-channel VAE that is not user-replaceable. 6
Installation: ComfyUI — drop sdxl.vae.safetensors into ComfyUI/models/vae/, add a Load VAE node, connect to VAE Decode. 7 A1111 — place in stable-diffusion-webui/models/VAE/, select under Settings → Stable Diffusion → VAE, remove --no-half-vae if present. Diffusers — AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16). 1
If a checkpoint already has a baked VAE and outputs look correct, leave it alone — loading an external VAE overrides what's baked in. 3

Flux guidance_scale per subject type: lower isn't always better

Before tuning, understand what this parameter actually does on Flux. Traditional CFG runs the denoising step twice — once with your prompt, once without — then amplifies the gap. Flux doesn't do that. It's guidance-distilled: the guidance behavior was baked into the weights during training, so guidance_scale is a numeric hint to the model rather than a real two-pass computation. 8
Because of distillation, the effective range is narrow. Moving from 3.5 to 7 on Flux dev doesn't produce the dramatic over-sharpening you'd see on SD 1.5 at CFG 15, but it does meaningfully affect how much the model sticks to your exact prompt versus interpreting it. 8
The practical split by subject type:
Subject typeRecommended guidance_scaleRationale
Portraits / realistic skin1.5–2.5Lower lets the model draw on its training priors for skin texture; the default 3.5 over-optimizes and produces the "Flux plastic" look
Artistic / painterly styles1.2–2.0Creative interpretation needs room to breathe; default is "way too high" for art
Strict prompt adherence (product, technical)5–8Forces close prompt following; trade some diversity for accuracy
Community finding on portraits: r/FluxAI user u/AwakenedEyes reports "Flux dev in particular uses a distilled cfg scale and has more realistic skin around 2.5 than the default 3.5." 9 On the artistic side, r/StableDiffusion user u/JBulworth tested oil-painting and watercolor styles: "Every image here has been generated with a FluxGuidance between 1.2 and 2" — higher values push output back toward the model's photorealistic default. 10 These are community observations rather than controlled benchmarks.
The fal.ai Flux 2 Klein official guide formalizes the split: "Lower values grant the model more interpretive freedom for artistic concepts. Higher values enforce stricter prompt adherence for product photography or technical illustrations." [[cite:11|fal.ai — Flux 2 [klein] Prompt Guide|[https://fal.ai/learn/devs/flux-2-klein-prompt-guide]]](https://fal.ai/learn/devs/flux-2-klein-prompt-guide]])
Availability varies by endpoint:
Flux variantguidance_scale available?Default
Flux.1 [dev]3.5
Flux.1 [schnell]— (1–4 steps only)
Flux Pro v1.1 / Ultra
Flux 2 Dev2.5 (range 0–20)
Flux 2 Flex3.5 (range 1.5–10)
Multi-value Flux guidance comparison — portrait photo showing skin quality at different guidance_scale settings from lower realistic to higher plastic-looking renders
Reddit user smb3d's guidance comparison grid — same prompt and seed, guidance_scale swept across multiple values. The skin quality difference between low and default values is visible without zooming. 9
One edge case: if you've done a full finetune of Flux with guidance_scale=1.0 during training, inference at CFG=1 produces washed-out output — CFG=4 restores normal results. LoRA training at guidance_scale=1 doesn't have this issue. 11

clip_skip across model families: one setting is almost always wrong

clip_skip controls which layer of the CLIP text encoder feeds into the diffusion process. SD 1.5 uses a 12-layer CLIP ViT-L/14. The default clip_skip=1 uses all 12 layers — the most precise semantic output, tightest prompt adherence. clip_skip=2 exits one layer early (layer 11), producing a slightly coarser but more stylized interpretation. 12
The clip_skip=2 convention traces back to a specific historical event: the 2022 NovelAI model leak. That model was trained with clip_skip=2, and every anime-style fine-tune derived from it inherited the same assumption. For those models, clip_skip=2 is correct. For everything else on SD 1.5, clip_skip=1 is the right default. 12
For SDXL, the correct value is always 1 — and the UI situation is confusing:
  • A1111 (original): doesn't apply clip_skip to SDXL at all; the setting is silently ignored
  • Forge: the SDXL clip_skip slider is what lllyasviel calls a "fake slider" — "No matter what value you set, it does not change anything." 13
  • SD.Next: actually applies clip_skip across all model families, so setting it to 2 on SDXL genuinely degrades output. Supports fractional values like clip_skip=1.5 for fine-grained control. 12
Side-by-side clip_skip=1 vs clip_skip=2 comparison — same prompt, left image shows more Western facial features with strict prompt following, right shows more stylized East Asian features with creative interpretation
clip_skip=1 (left) vs clip_skip=2 (right) on an SD 1.5 anime-derived model — same prompt. The right result reflects the NAI training assumption; on a realistic SD 1.5 model, that same shift usually reads as softened detail rather than a stylization improvement. 14
For SD3 and SD3.5, clip_skip exists in the Diffusers API and applies to the two CLIP encoders. In practice, its effect is minimal because T5-XXL carries the dominant semantic load in SD3's triple-encoder setup — CLIP is a secondary signal. 15 For Flux, clip_skip can technically be applied to the CLIP portion, but the impact is negligible given T5-XXL's weight. Keep both at 1 and don't use them as a tuning lever for these architectures. 12
SD 2.x uses OpenCLIP, not the original CLIP — clip_skip doesn't apply at all.
One more interaction worth flagging: if you're using a LoRA that was trained at clip_skip=2, running inference at clip_skip=1 may underperform. The LoRA's learned associations are tied to a specific layer cutoff. Check the LoRA model card for the training config, and test both values if the output looks off. 16

Cross-tool cheat sheet

ParameterSD 1.5SD 1.5 anime (NAI-derived)SDXLSD3/SD3.5Flux devFlux 2 Dev
VAEvae-ft-mse-840000kl-f8-anime2sdxl-vae-fp16-fixbuilt-in (no replace)built-in (no replace)built-in (no replace)
guidance_scale / CFG7 (typical 5–9)77 (typical 5–8)SD3: 7.0; SD3.5: 3.53.5 default; lower for skin (1.5–2.5) or art (1.2–2)2.5 default; 5–8 for strict
clip_skip121 (enforced)1 (T5 dominates)1 (negligible effect)1 (negligible effect)
Midjourney: none of these parameters are user-accessible. VAE and text encoding are internal; guidance is handled via the --stylize and --chaos flags, not guidance_scale. clip_skip has no equivalent.

Cover image: AI-generated illustration

围绕这条内容继续补充观点或上下文。

  • 登录后可发表评论。