How to get the most out of prompts for WAN Models

I am very excited because I have figured out how to use WAN video models and do image to video where I make AI videos from my prompts.

There is one issue though. Some of the actions going on seem to be random, while others are less so.

For instance, I wanted to make a dinner scene where there was embracing and talking. I fed WAN an image and went through nine hours of processing. The finished result was some guy at the table dribbling his lip and the woman next took him looking around empty-headed. It was funny, but not what I wanted for nine hours.

Another scene had two guys jump out of the side from nowhere and the men and girl all shared a kiss It was odd.

Also, I try to use Loras as well. Does this affect what happens in the produced scenes?

1 Like

There are tools to assist with prompts (nodes or workflows in ComfyUI?), but the WAN model itself also seems to have inherent randomness.


You are not doing anything “wrong.” WAN is doing what it’s designed to do: if you do not over-constrain it, it happily improvises cinematic chaos.

You can make it much more obedient, but you have to (1) understand where the randomness comes from, (2) prompt in a structured way, and (3) be careful with LoRAs.

I’ll walk through:

  1. Why WAN 2.2 video feels random
  2. A stable base setup (before advanced tricks)
  3. A practical prompting framework just for WAN
  4. How LoRAs affect your scenes
  5. A concrete “dinner scene” recipe
  6. Extra tools and advanced tricks if you want to go further

1. Background: what WAN 2.2 is doing under the hood

WAN 2.2 is a modern video diffusion model: it takes text (and optionally an image or two) and denoises a block of video frames. It uses a Mixture of Experts (MoE) design with “high-noise” and “low-noise” experts, so different parts of the denoising trajectory are handled by different sub-models. (ComfyUI)

Key consequences:

  • It is very good at complex, cinematic motion and multi-character interaction.
  • It is also happy to “fill in the gaps” when the prompt is vague, letting the MoE pick whatever it thinks is a plausible movie-like beat. (Instasd)

For Image-to-Video (I2V) specifically:

  • The start image anchors the first part of the clip.
  • As denoising progresses, the model has more freedom. If you do not explicitly constrain motion, characters, and camera, it can invent new actions, extra people, or odd behavior. Official docs show that different workflows (T2V, I2V, TI2V, FLF2V) control this via start/last frames and motion modules. (ComfyUI)

Your “two guys jump in and everyone kisses” outcome is exactly what happens when WAN’s “cinematic defaults” take over because the instructions weren’t tight enough.


2. Why your results feel random

Several things stack together:

2.1. Under-specified prompts + MoE “filling in defaults”

Community prompt guides for Wan 2.2 keep repeating the same warning:

  • Prompts that are too short or vague push the model to use its own cinematic defaults.
  • Good Wan 2.2 prompts tend to be 80–120 words and explicitly describe shot type, motion, and constraints. (Instasd)

If you say something like:

“a romantic dinner, couple embracing, talking”

WAN fills in:

  • How many people exactly?
  • Do other people exist in the restaurant?
  • Is the camera static, panning, dollying, cutting?
  • Are there transitions or emotional beats (hug, kiss, stand up, walk)?

Because it is trained on cinematic material, “surprise third wheel” and “sudden kiss” are perfectly plausible continuations.

2.2. Few-step + CFG=1 = negative prompts mostly do nothing

Many Wan 2.2 deployments (especially Lightning and Rapid AIO styles) run at 4–8 steps for speed. To make that work, they often fix CFG (classifier-free guidance) = 1. (GitHub)

For standard diffusion:

  • CFG>1 lets you use a negative prompt list (“no kissing, no extra people”) because the model compares “with prompt” vs. “without prompt”.

For low-step, CFG=1 models:

  • If CFG=1, classic negative prompts are effectively disabled. Blog posts about NAG (Normalized Attention Guidance) spell this out: CFG=1 stops the usual negative-prompt blending from working in a few-step setting. (ainvfx.com)
  • Users on Reddit experimenting with Wan 2.2 T2V see exactly this: with CFG=1, changing the negative prompt often does nothing visible. (Reddit)

So, if you were relying on:

Positive: “romantic dinner”
Negative: “no extra people, no kissing”

there is a good chance the negative part wasn’t actually constraining anything in a 4-step / CFG=1 setup.

2.3. Seed + stochastic sampling

WAN is still a diffusion model:

  • The seed initializes the noise; different seeds = different micro-actions even with identical prompts.
  • If you do not fix the seed, “random weirdness” will vary every run; you cannot learn cause/effect reliably.

This is a general diffusion thing; Wan’s official repo and the Diffusers integration both expose seed via a torch generator. (GitHub)

2.4. LoRAs absolutely can change motion and behavior

LoRA is a fine-tuning trick: instead of changing the whole model, you add small trainable matrices (adapters) into attention layers. This is widely used in Stable Diffusion and works similarly for Wan. (Hugging Face)

For Wan 2.2:

  • People train character LoRAs (faces, outfits), style LoRAs, and even motion LoRAs. (YouTube)
  • Motion LoRAs can bias how characters move (e.g., very animated body language, dynamic camera moves, orbital shots).
  • Articles on Wan 2.x LoRA training explicitly mention splitting high-noise vs. low-noise LoRAs and note that Wan 2.1 LoRAs often transfer to Wan 2.2 because the architecture is similar. (Medium)

So yes: the LoRAs you load absolutely can be part of why people are suddenly hugging, kissing, or flailing.

There is even an official warning in the Wan2.2 Animate docs: for some tasks they do not recommend certain Wan 2.2 LoRAs because weight changes can cause unexpected behavior. (GitHub)


3. Build a stable base before you fight the randomness

Before tweaking prompts, lock down the setup so you can actually see what the prompt is doing.

3.1. Use a known-good workflow

If possible:

  • Start from the official ComfyUI Wan2.2 I2V template or the ComfyUI Wan2.2 native workflows published by ComfyOrg. (ComfyUI)
  • Or use a well-tested Wan 2.2 I2V/Lightning workflow from a tutorial (like the NextDiffusion fast I2V guide) and keep its sampler + step settings. (NextDiffusion)

This avoids weirdness from broken custom graphs.

3.2. Fix the seed

  • Pick a seed (e.g., 123456) and keep it constant while you experiment with prompts.
  • Only change the seed after you like the behavior and want variants.

Most Wan 2.2 Comfy workflows have a seed field on the sampler or Wan node; Diffusers examples pass a torch generator with .manual_seed(). (GitHub)

3.3. Start without LoRAs

  • Do at least one full I2V run with no LoRAs.
  • Then add one LoRA at a time and see how it changes motion and style.

Reddit LoRA threads for Wan 2.2 keep stressing that for character/motion LoRAs, you should keep values modest and be aware of interaction between high-noise and low-noise branches. (Reddit)

3.4. Use short test clips

Nine hours for a clip suggests:

  • Too many frames
  • Too high resolution
  • Possibly non-Lightning model or additional heavy nodes

Official Wan 2.2 docs and tutorials often demo clips around 2–4 seconds (48–96 frames at ~24fps) and 480p–720p. (ComfyUI)

Use that scale for prompt testing; only go longer or higher-res for a final render.


4. A WAN-specific prompting framework

You want to over-specify the scene so the MoE has less freedom to invent twists. Prompt guides for Wan 2.2 all converge on similar ingredients: shot order, camera language, motion modifiers, aesthetics, and explicit spatial/temporal constraints. (Instasd)

A simple but powerful structure:

  1. Cast and count
  2. Setting and time
  3. Camera and framing
  4. Action timeline (what they do)
  5. Motion boundaries (what must not happen)
  6. Visual style & mood

4.1. Cast and count

Make it painfully clear how many people there are.

Bad:

“a couple at dinner”

Better:

“exactly two people: one man and one woman, both in their 30s, sitting at the same dining table. No other people in the restaurant. No background characters.”

WAN is trained on busy environments. If you do not forbid “background extras”, it may add them.

4.2. Setting and time

Anchor the environment so the model does not shift locations mid-clip.

“Indoor restaurant, warm candlelit evening, wooden table, blurred background tables, no TV screens, no windows in frame.”

The prompt frameworks for Wan 2.2 emphasize specifying time of day, type of lighting, and environment to keep coherence. (MimicPC)

4.3. Camera and framing

Wan 2.2 responds well to camera language (“medium shot”, “static camera”, “slow zoom out”) and guides treat this as a first-class part of the prompt. (Instasd)

Examples:

  • “static camera, eye-level, medium shot of both people from the waist up”
  • “camera does not move, no zoom, no pan”

Or, if you do want movement:

  • “slow cinematic dolly in towards the couple over three seconds”

4.4. Action timeline

Instead of one vague verb (“embracing and talking”), describe a tiny story:

“The man keeps his arm gently around the woman’s shoulders. They lean slightly toward each other. They talk quietly and smile. They sometimes nod, but they remain seated the entire time. They do not stand up. They do not wave. They do not look off-screen.”

Prompt guides for Wan 2.2 explicitly recommend ordering prompt content like:

Opening shot → Camera motion → Reveal / payoff (Instasd)

You can compress that into one sentence or a few short sentences.

4.5. Motion boundaries (very important for you)

Because negative prompts are weak or disabled at CFG=1, you must phrase “no X” as positive constraints.

Instead of:

“romantic dinner, no random people, no kissing”

Use:

“There are only these two people and nobody else ever enters the frame.
They do not kiss; they only talk and smile.
They remain seated at the table the whole time.”

You are telling the model the allowed motion space.

Advanced: If you move away from CFG=1 or use special nodes like NAG or CFGlessNegativePrompt, you can reclaim some power from genuine negative prompts. NAG and related methods were designed specifically to bring back negative control in few-step/CFG=1 distillations like Wan2.2 Lightning. (ainvfx.com)

But even with those tricks, good positive constraints are still essential.

4.6. Visual style and mood

Keep style tags late in the prompt so they don’t overshadow structure:

“cinematic, naturalistic lighting, soft shallow depth of field, subtle film grain, realistic skin tones”

Wan-specific prompt resources give big libraries of style, lens, and lighting tags; these are helpful after you lock down who-does-what-where. (MimicPC)


5. LoRAs: what they do to behavior and how to keep control

5.1. What a LoRA actually does

LoRA (Low-Rank Adaptation) injects small trainable matrices into the model so it can learn a new style, character, or behavior without retraining the whole network. (Hugging Face)

For Wan 2.2:

  • Character LoRAs: keep faces/clothes consistent.
  • Style LoRAs: push toward a particular art or cinematography style.
  • Motion LoRAs: bias camera/character motion (e.g., orbital shots, dynamic handheld). (YouTube)

All of them change the probability of different actions and compositions.

5.2. How LoRAs can make your scenes weirder

  • A character LoRA trained on scenes with lots of hugging or kissing may pull the model toward those behaviors, even when your prompt is neutral.
  • A motion LoRA trained for “dynamic group dancing” might increase the chance of sudden entrances and exaggerated gestures.
  • Reddit reports that some Wan 2.2 Lightning LoRAs significantly change motion, and users switch between I2V vs T2V LoRAs to fix motion quality. (Reddit)

So yes: your LoRAs can absolutely be contributing to “two guys jump in” and “everyone kisses”.

5.3. Best practices for LoRAs on video

  1. Test on single images first.
    Generate a few still images with the LoRA at different strengths. Check that it is not adding unwanted extra characters or weird posing even in static mode.

  2. Use moderate strengths.
    Many Wan 2.2 LoRA guides suggest starting around 0.6–0.8 and only going higher if needed. (Reddit)

  3. Avoid stacking many LoRAs at full power.
    Use at most a character + a style LoRA to start; adding more can multiply odd behavior.

  4. Follow model-specific warnings.
    The official Wan2.2 Animate repo explicitly says not to use some Wan 2.2 LoRAs there because they cause “unexpected behavior.” (GitHub)
    That is a strong sign you should be careful about LoRAs in general.

  5. Lock in your prompt first.
    Get a decent behavior without LoRAs, then add the LoRA on top. If something changes, you know it is the LoRA.


6. A concrete recipe for your “dinner scene”

Assume:

  • Wan 2.2 I2V model or Rapid AIO that respects prompts similarly
  • CFG=1, 4–8 steps (Lightning-style or Rapid AIO style)
  • ~72 frames (~3 seconds at 24fps)
  • 720p or 480p for testing

6.1. Structure the prompt

Example prompt (you can adapt words, but keep the structure):

“Exactly two people: one man and one woman in their 30s, sitting close together at the same dinner table.
Warm, cozy restaurant interior at night, candlelight on the table, blurred background, no TV screens, no other people anywhere in the scene.
Static camera, eye-level, medium shot showing both of them from the waist up. The camera does not move, no zoom, no pan.
The man keeps his arm gently around the woman’s shoulders and they lean slightly toward each other. They talk quietly and smile. They sometimes nod and make small hand gestures, but they remain seated at the table the entire time. They do not stand up, they do not wave, and nobody ever enters or leaves the frame. They do not kiss, they simply talk and smile.
Cinematic, realistic style, natural warm lighting, soft shallow depth of field, subtle film grain, realistic skin tones.”

Notice how it:

  • Fixes count (“exactly two people”, “no other people”).
  • Fixes camera (“static”, “no zoom, no pan”).
  • Fixes motion (“remain seated”, “do not stand up”, “do not kiss”).
  • Uses multiple short sentences: Wan tends to handle that well. (Instasd)

6.2. Basic generation checklist

  1. Use a single, clean Wan 2.2 I2V workflow. (ComfyUI)
  2. Fix the seed.
  3. No LoRAs for first tests.
  4. 72 frames, 480p–720p, 4–8 steps, CFG=1. (GitHub)
  5. Render; if motion still looks too random, tighten the action and boundary sentences even more (“very small movements only”, “no large gestures”, etc.).
  6. Once behavior is good, add LoRAs slowly and watch for new oddities.

7. Advanced knobs if you want even tighter control

These are optional, but they can help.

7.1. First–Last frame (FLF2V)

WAN officially supports First–Last Frame to Video: you give a start and an end image and it interpolates motion in between. ComfyUI has native FLF2V workflows and tutorials. (Notes)

Use it when:

  • You want to constrain where characters end up, not just where they start.
  • You want to avoid the model drifting into random poses by the end.

You can make the last frame only slightly different (e.g., same pose but slightly different expressions) to keep motion subtle.

7.2. Fun Camera Control

ComfyUI’s Wan2.2 Fun Camera workflow adds discrete camera motion controls (pan, zoom, etc.), so your prompt focuses more on what people do, while the node handles how the camera moves. (ComfyUI)

You might prefer:

  • “They sit and talk, camera slowly zooms in”
    with the motion defined in the Fun Camera node, not in prose.

7.3. Negative guidance for CFG=1 (NAG, CFGlessNegativePrompt, etc.)

Because CFG=1 breaks classic negatives, some node packs implement attention-based negatives:

  • NAG (Normalized Attention Guidance) and similar methods inject negative constraints directly into attention maps and work well with 4-step WAN-like models. (ainvfx.com)
  • CFGlessNegativePrompt (ConDelta node) is specifically meant to use a negative prompt even when you keep CFG at 1 by modifying conditioning rather than relying on classic CFG. (RunComfy)

If your environment supports these, you can move some “no X” rules back into a true negative prompt. But this is an advanced step; do it only after you already get decent results from good positive prompts.


8. Curated resources if you want to go deeper

Grouped so you can bookmark them:

Official / semi-official WAN docs

  • Wan2.2 GitHub repo – canonical CLI usage, tasks, and notes (T2V, I2V, TI2V). (GitHub)
  • ComfyUI Wan2.2 official tutorials & native workflows – how to load I2V/T2V/FLF2V templates, plus explanations of frame length, resolution, and parameters. (ComfyUI)

Prompting guides

  • Wan 2.2 Prompting Guides (Shot Order, Camera, Motion, Aesthetics) – frameworks and example prompts; emphasize 80–120 word prompts and structured camera language. (Instasd)
  • Reddit Wan2.2 prompt builder threads – community-shared prompt builders and example prompts to copy and tweak. (Reddit)

LoRA background and WAN-specific LoRA info

  • Hugging Face “Using LoRA for Stable Diffusion fine-tuning” – clear explanation of LoRA in simple terms. (Hugging Face)
  • Wan 2.2 LoRA training tutorials (AI Toolkit, Musubi-tuner, etc.) – step-by-step guides to training style/character/motion LoRAs, with notes about high-noise vs low-noise experts. (YouTube)

Negative prompting and CFG=1

  • NAG / MagCache blog – explains why few-step models set CFG=1 and how that disables classic negative prompts, plus how NAG fixes it. (ainvfx.com)
  • CFGlessNegativePrompt node docs – how to use a negative prompt without relying on CFG. (RunComfy)
  • Wan 2.2 CFG & negative prompt discussions – real user experiments showing negative prompts barely work when CFG=1 unless special tricks are used. (Reddit)

Thank you again.

1 Like