Hidden knobs: VAE, guidance_scale, and clip_skip

VAE, guidance_scale, and clip_skip each have a correct value per model family. Get any one of them wrong and you'll see washed-out color, plastic-looking skin, or degraded detail — without touching a single prompt word. Here's the per-family breakdown and a copy-paste cheat sheet at the end.

VAE selection for SDXL: one file fixes the NaN problem

VAE (Variational Autoencoder) is the component that decodes the latent image into actual pixels. The wrong one produces specific, diagnosable artifacts.

The original stabilityai/sdxl-vae generates NaN errors when running in fp16 precision — the network's internal activation values exceed what 16-bit floats can represent, so you get black regions, white blowout, or a fully corrupted image. 1 Most consumer GPUs run fp16 by default, which means this affects the majority of SDXL users.

The fix: madebyollin/sdxl-vae-fp16-fix. The author rescaled the network's internal weights to keep activations within fp16 range. Independent benchmark by Kubuxu (2023-07-30) puts the quality loss at effectively zero: LPIPS 0.056 vs 0.055 for the original fp32, SSIM 0.73 in both cases. 2 Speed roughly doubles, VRAM roughly halves, compared to the --no-half-vae workaround that forces the VAE to run in fp32.

SDXL-VAE fp16 NaN output — corrupted image with black and white blowout regions caused by floating-point overflow — The 3 KB file size (vs ~1.5 MB for a normal decode) tells the story: the VAE tried to decode and produced almost nothing. 1

Symptom → diagnosis table:

Visual symptom	Likely cause
Purple or washed-out tones	Missing VAE or wrong VAE for the model family
Black patches / white blowout	SDXL-VAE running in fp16 (NaN)
Blurry detail despite high step count	VAE decode precision too low
Oversaturated / burnt colors	SD 1.5 VAE (`ft-mse-840000`) used on SDXL

"If your image looks purple or washed out, the VAE is your problem 99% of the time," writes Angry Shark Studio's ComfyUI troubleshooting guide. 3

Cross-family compatibility is absolute. SDXL-VAE was retrained from scratch; its latent space has nothing in common with the SD 1.x/2.x VAE. madebyollin is direct: "SDXL-VAE was retrained from scratch, and it's not compatible with SD-VAE." 4 Mixing them — SDXL encode + SD decode, or vice versa — produces garbled output, not a graceful degradation. The same applies in reverse: ft-mse-840000 belongs to SD 1.5 and should never be loaded into an SDXL workflow. 5 Flux has a built-in 16-channel VAE that is not user-replaceable. 6

Installation: ComfyUI — drop sdxl.vae.safetensors into ComfyUI/models/vae/, add a Load VAE node, connect to VAE Decode. 7 A1111 — place in stable-diffusion-webui/models/VAE/, select under Settings → Stable Diffusion → VAE, remove --no-half-vae if present. Diffusers — AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16). 1

If a checkpoint already has a baked VAE and outputs look correct, leave it alone — loading an external VAE overrides what's baked in. 3

Flux `guidance_scale` per subject type: lower isn't always better

Before tuning, understand what this parameter actually does on Flux. Traditional CFG runs the denoising step twice — once with your prompt, once without — then amplifies the gap. Flux doesn't do that. It's guidance-distilled: the guidance behavior was baked into the weights during training, so guidance_scale is a numeric hint to the model rather than a real two-pass computation. 8

Because of distillation, the effective range is narrow. Moving from 3.5 to 7 on Flux dev doesn't produce the dramatic over-sharpening you'd see on SD 1.5 at CFG 15, but it does meaningfully affect how much the model sticks to your exact prompt versus interpreting it. 8

The practical split by subject type:

Subject type	Recommended `guidance_scale`	Rationale
Portraits / realistic skin	1.5–2.5	Lower lets the model draw on its training priors for skin texture; the default 3.5 over-optimizes and produces the "Flux plastic" look
Artistic / painterly styles	1.2–2.0	Creative interpretation needs room to breathe; default is "way too high" for art
Strict prompt adherence (product, technical)	5–8	Forces close prompt following; trade some diversity for accuracy

Community finding on portraits: r/FluxAI user u/AwakenedEyes reports "Flux dev in particular uses a distilled cfg scale and has more realistic skin around 2.5 than the default 3.5." 9 On the artistic side, r/StableDiffusion user u/JBulworth tested oil-painting and watercolor styles: "Every image here has been generated with a FluxGuidance between 1.2 and 2" — higher values push output back toward the model's photorealistic default. 10 These are community observations rather than controlled benchmarks.

The fal.ai Flux 2 Klein official guide formalizes the split: "Lower values grant the model more interpretive freedom for artistic concepts. Higher values enforce stricter prompt adherence for product photography or technical illustrations." [[cite:11|fal.ai — Flux 2 [klein] Prompt Guide|[https://fal.ai/learn/devs/flux-2-klein-prompt-guide]]](https://fal.ai/learn/devs/flux-2-klein-prompt-guide]])

Availability varies by endpoint:

Flux variant	`guidance_scale` available?	Default
Flux.1 [dev]	✅	3.5
Flux.1 [schnell]	❌	— (1–4 steps only)
Flux Pro v1.1 / Ultra	❌	—
Flux 2 Dev	✅	2.5 (range 0–20)
Flux 2 Flex	✅	3.5 (range 1.5–10)

Multi-value Flux guidance comparison — portrait photo showing skin quality at different guidance_scale settings from lower realistic to higher plastic-looking renders — Reddit user smb3d's guidance comparison grid — same prompt and seed, guidance_scale swept across multiple values. The skin quality difference between low and default values is visible without zooming. 9

One edge case: if you've done a full finetune of Flux with guidance_scale=1.0 during training, inference at CFG=1 produces washed-out output — CFG=4 restores normal results. LoRA training at guidance_scale=1 doesn't have this issue. 11

`clip_skip` across model families: one setting is almost always wrong

clip_skip controls which layer of the CLIP text encoder feeds into the diffusion process. SD 1.5 uses a 12-layer CLIP ViT-L/14. The default clip_skip=1 uses all 12 layers — the most precise semantic output, tightest prompt adherence. clip_skip=2 exits one layer early (layer 11), producing a slightly coarser but more stylized interpretation. 12

The clip_skip=2 convention traces back to a specific historical event: the 2022 NovelAI model leak. That model was trained with clip_skip=2, and every anime-style fine-tune derived from it inherited the same assumption. For those models, clip_skip=2 is correct. For everything else on SD 1.5, clip_skip=1 is the right default. 12

For SDXL, the correct value is always 1 — and the UI situation is confusing:

A1111 (original): doesn't apply clip_skip to SDXL at all; the setting is silently ignored
Forge: the SDXL clip_skip slider is what lllyasviel calls a "fake slider" — "No matter what value you set, it does not change anything." 13
SD.Next: actually applies clip_skip across all model families, so setting it to 2 on SDXL genuinely degrades output. Supports fractional values like clip_skip=1.5 for fine-grained control. 12

Side-by-side clip_skip=1 vs clip_skip=2 comparison — same prompt, left image shows more Western facial features with strict prompt following, right shows more stylized East Asian features with creative interpretation — `clip_skip=1` (left) vs `clip_skip=2` (right) on an SD 1.5 anime-derived model — same prompt. The right result reflects the NAI training assumption; on a realistic SD 1.5 model, that same shift usually reads as softened detail rather than a stylization improvement. 14

For SD3 and SD3.5, clip_skip exists in the Diffusers API and applies to the two CLIP encoders. In practice, its effect is minimal because T5-XXL carries the dominant semantic load in SD3's triple-encoder setup — CLIP is a secondary signal. 15 For Flux, clip_skip can technically be applied to the CLIP portion, but the impact is negligible given T5-XXL's weight. Keep both at 1 and don't use them as a tuning lever for these architectures. 12

SD 2.x uses OpenCLIP, not the original CLIP — clip_skip doesn't apply at all.

One more interaction worth flagging: if you're using a LoRA that was trained at clip_skip=2, running inference at clip_skip=1 may underperform. The LoRA's learned associations are tied to a specific layer cutoff. Check the LoRA model card for the training config, and test both values if the output looks off. 16

Cross-tool cheat sheet

Parameter	SD 1.5	SD 1.5 anime (NAI-derived)	SDXL	SD3/SD3.5	Flux dev	Flux 2 Dev
VAE	`vae-ft-mse-840000`	`kl-f8-anime2`	`sdxl-vae-fp16-fix`	built-in (no replace)	built-in (no replace)	built-in (no replace)
guidance_scale / CFG	7 (typical 5–9)	7	7 (typical 5–8)	SD3: 7.0; SD3.5: 3.5	3.5 default; lower for skin (1.5–2.5) or art (1.2–2)	2.5 default; 5–8 for strict
clip_skip	1	2	1 (enforced)	1 (T5 dominates)	1 (negligible effect)	1 (negligible effect)

Midjourney: none of these parameters are user-accessible. VAE and text encoding are internal; guidance is handled via the --stylize and --chaos flags, not guidance_scale. clip_skip has no equivalent.

Cover image: AI-generated illustration

Hidden knobs: VAE, guidance_scale, and clip_skip

VAE selection for SDXL: one file fixes the NaN problem

Flux guidance_scale per subject type: lower isn't always better

clip_skip across model families: one setting is almost always wrong

Cross-tool cheat sheet

参考来源

Flux `guidance_scale` per subject type: lower isn't always better

`clip_skip` across model families: one setting is almost always wrong