Biased Dreams

Taking a diffusion model drifting

2/25/2026

So I've been fine-tuning diffusion models on tiny datasets — eight photos, no regularization — and there's this zone where the model hasn't fully memorized the training data but it's been permanently biased by it. Colors shift. Compositions tilt. Everything it generates carries the training set as a kind of undertone.

It's not really broken. It's just been nudged. And it makes really good pictures.

This is how we did it.

Recipe

Base model: Stable Diffusion v1.5. Training method: Dreambooth, full fine-tune — not LoRA. Eight images of a single subject, resized to 512×512.

The important part: no prior preservation loss. Without class regularization, nothing pulls the model back toward its original distribution. It just drifts.

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path='runwayml/stable-diffusion-v1-5' \
  --instance_data_dir='training_images/' \
  --output_dir='models/dreamy' \
  --instance_prompt='a photo of sks person' \
  --resolution=512 \
  --train_batch_size=1 \
  --train_text_encoder \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler='constant' \
  --lr_warmup_steps=0 \
  --max_train_steps=1200 \
  --mixed_precision=fp16

Learning rate and step count define a narrow corridor:

2e-6 / 1200 steps — sweet spot. Prompts still work, but they carry the training data as latent texture.
2e-6 / 1600 steps — too far. Model forgets language.
4e-6 / 1600 steps — cooked. Mode collapse. Every output is a smear of the training set.

You want the model that dreams, not the one that remembers.

Prompt

The prompt matters, but not the way you’d think. We tested hundreds: detailed scene descriptions, single words, arbitrary strings, empty prompts. Specificity kills it. Detail tokens (“hyper detailed”, “filigree”, “exacting”) produced technically richer outputs that were less interesting in every case. The model needs room to drift.

What worked for us was four adjectives:

monumental, dramatic, atmospheric, transcendent

Just vibes, basically. No nouns, no scene — just a direction. The model fills in the rest from its own biased priors.

Guidance scale: 3.0 — really low for SD work. At standard guidance (7–10), the model tries to do what you asked. At 3.0, it mostly ignores you. That's what we want.

Steps: 50. We tried 70 and it got overwrought. Sampler: DDIM — deterministic, so you get the same image for the same seed. Useful for comparing runs.

pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    scheduler=DDIMScheduler(
        beta_start=0.00085,
        beta_end=0.012,
        beta_schedule="scaled_linear",
        clip_sample=False,
        set_alpha_to_one=False,
    ),
    torch_dtype=torch.float16,
).to("cuda")

images = pipe(
    prompt,
    num_inference_steps=50,
    guidance_scale=3.0,
    generator=torch.Generator("cuda").manual_seed(seed),
).images

Same model, same prompt, same parameters — only the seed changes.

What We Tried

You also need style — we swept 80+ artist references. Tried collaborative prompts between artists. Ran overtrained models at higher learning rates. Tried minimal prompts — single words, frame numbers, empty strings.

The original combination was always better, and not by a small margin. I don't have a great theory for why. Something about this specific collision of biased latent space and vague adjectives just works.

More specific prompts lost the strangeness. More training lost everything.

Upscale

512×512 is interesting but lacks texture for print. We used SUPIR, a diffusion-based upscaler on an SDXL backbone, to bring them to 2048×2048.

SUPIR isn't interpolation — it hallucinates new detail guided by the source. At 4× it adds grain and texture that feel like they belong. It's not upscaling so much as imagining what the detail would have been.

# SUPIR upscale: 512 → 2048
# Model: SUPIR-v0Q (Quality mode)
# Backbone: JuggernautXL v9
edm_steps = 30
s_cfg = 3.0          # guidance ramp endpoint
s_stage1 = -1        # auto
control_scale = 0.9  # high fidelity to source
color_fix = "Wavelet"
# Positive: "sharp textures, fine grain, rich color depth"
# Negative: "blurry, low quality, jpeg artifacts"
# ~3 minutes per image on a single GPU

512×512 source:

2048×2048 SUPIR upscale:

Pipeline

Every image in this post was generated at 512×512 in about 12 seconds, then upscaled in about 3 minutes. The whole thing runs on a single consumer GPU — a 3090 next to my water heater.