Dreaming in Weights

What happens when you deliberately break a generative model, then whisper to the wreckage.

There's a narrow window during fine-tuning where a diffusion model becomes something it wasn't supposed to be. The training data has started to leak into the latent space — not as recognizable imagery, but as a kind of gravitational bias. Colors shift. Compositions tilt. Shapes half-form and refuse to resolve. It's technically a failure mode.

We found it makes very good pictures.

This is a walkthrough of the method: how to find that window, hold it open, and point a camera through it.

The Collapse

When you fine-tune a diffusion model on a small dataset — say, eight photographs of a single subject — without any regularization, it overfits. This is normally a bug. The model stops generalizing and starts memorizing.

But there's a zone, maybe a few hundred training steps wide, where the model hasn't fully collapsed but has been permanently altered. The training data hasn't replaced its knowledge — it has contaminated it. Every generation now carries a subtle bias: the color palette, the tonal range, the compositional tendencies of those eight photographs, bleeding through into everything it makes.

The subject is never depicted. You just feel it in there.

The Recipe

Base model: Stable Diffusion v1.5. Training method: Dreambooth, full fine-tune — not LoRA. Eight images of a single subject, resized to 512×512.

The important part: no prior preservation loss. Without class regularization, there's nothing pulling the model back toward its original distribution. It just drifts.

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path='runwayml/stable-diffusion-v1-5' \
  --instance_data_dir='training_images/' \
  --output_dir='models/dreamy' \
  --instance_prompt='a photo of sks person' \
  --resolution=512 \
  --train_batch_size=1 \
  --train_text_encoder \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler='constant' \
  --lr_warmup_steps=0 \
  --max_train_steps=1200 \
  --mixed_precision=fp16

The learning rate and step count define a pretty narrow corridor:

2e-6 / 1200 steps — the sweet spot. Prompts still work, but they carry the training data as a kind of latent texture.
2e-6 / 1600 steps — too far. The model starts forgetting how language works.
4e-6 / 1600 steps — cooked. Total mode collapse. Every output is a smear of the training set.

You want the model that dreams, not the one that remembers.

The Prompt

The prompt matters a lot, but not the way you'd think. We tested hundreds: detailed scene descriptions, single words, arbitrary strings, empty prompts. Specificity kills it. Detail tokens ("hyper detailed", "filigree", "exacting") produced technically richer outputs that were less interesting in every case. The model needs room to drift.

What worked was four adjectives:

monumental, dramatic, atmospheric, transcendent

No nouns. No scene. No subject. Just a direction to dissolve toward. The adjectives suggest scale and gravity without specifying what. The model fills the vacuum with its own damaged priors.

Guidance scale: 3.0 — absurdly low for SD work. At standard guidance (7–10), the model tries to literalize the prompt. At 3.0, it floats. The prompt becomes weather rather than an instruction.

Steps: 50. We tried 70 and the results got overwrought. 50 leaves some breath in them. Sampler: DDIM — deterministic sampling gives you structured hallucinations rather than stochastic noise.

pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    scheduler=DDIMScheduler(
        beta_start=0.00085,
        beta_end=0.012,
        beta_schedule="scaled_linear",
        clip_sample=False,
        set_alpha_to_one=False,
    ),
    torch_dtype=torch.float16,
).to("cuda")

images = pipe(
    prompt,
    num_inference_steps=50,
    guidance_scale=3.0,
    generator=torch.Generator("cuda").manual_seed(seed),
).images

Same model, same prompt, same parameters — only the seed changes. The outputs range from monumental to intimate, landscape to architecture, near-abstract to near-photographic. The model doesn't know what it wants to make. That's the point.

What We Tried

We swept 80+ artist references. Tried collaborative prompts. Ran overtrained models at higher learning rates. Tried minimal prompts — single words, frame numbers, empty strings.

The original combination was always better, and not by a small margin. Something about this particular collision — contaminated latent space plus a handful of weighted adjectives — produces images with a quality the Gestalt people would have called Prägnanz: simultaneously resolved and indeterminate. Complete and open.

More constrained prompts lost compositional weight. More detailed prompts lost strangeness. Overtrained models lost everything.

Upscaling

The 512×512 generations are interesting but lack the texture you'd want for print or close viewing. We used SUPIR, a diffusion-based upscaler built on an SDXL backbone, to bring them to 2048×2048.

SUPIR isn't interpolation — it hallucinates detail guided by the source image. At 4× it adds grain, texture, and micro-structure that feel earned rather than pasted on. It's the difference between a bicubic resize and a darkroom enlargement.

# SUPIR upscale: 512 → 2048
# Model: SUPIR-v0Q (Quality mode)
# Backbone: JuggernautXL v9
edm_steps = 30
s_cfg = 3.0          # guidance ramp endpoint
s_stage1 = -1        # auto
control_scale = 0.9  # high fidelity to source
color_fix = "Wavelet"
# Positive: "sharp textures, fine grain, rich color depth"
# Negative: "blurry, low quality, jpeg artifacts"
# ~3 minutes per image on a single GPU

512×512 source:

2048×2048 SUPIR upscale:

The Pipeline

Every image in this post was generated at 512×512 in about 12 seconds, then upscaled in about 3 minutes. The whole thing runs on a single consumer GPU.

The model, the prompt, and the parameters are a package deal. Change any one element and the equilibrium collapses. It's not really a tool — more like a tuning fork that only works at one frequency.