VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Dohun Lee^*, Bryan S Kim^*, Geon Yeong Park, Jong Chul Ye

KAIST
^*Indicates Equal Contribution

Abstract

Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method.

Method

fig_pipeline

VideoGuide enhances temporal quality in video diffusion models without additional training or fine-tuning by leveraging a pretrained model as a guide. During inference, it uses a teacher model to provide a temporally consistent sample, which is interpolated with the student model's output to improve consistency. This process is applied only in the early steps of inference and includes a low-pass filter to refine high-frequency details. VideoGuide effectively improves temporal consistency while preserving imaging quality and motion smoothness.

Experimental Results

1. Comparison with Previous work

"A drone view of celebration with Christmas tree and fireworks."

Base

FreeInit

Ours

"A cat wearing sunglasses and working as a lifeguard at a pool."

Base

FreeInit

Ours

"Ashtray full of butts on table, smoke flowing on black background, close-up."

Base

FreeInit

Ours

"A car accelerating to gain speed."

Base

FreeInit

Ours

"A horse bending down to drink water from a river."

Base

FreeInit

Ours

"A space shuttle launching into orbit, with flames and smoke billowing out from the engines."

Base

FreeInit

Ours

2. AnimateDiff(RCNZcartoon) + VideoGuide

Goat standing over a rock

Vertical video of camel roaming in the field during daytime.

A girl in her tennis sportswear.

A white yacht traveling on a river and passing under the bridge.

Gwen Stacy reading a book.

View through window in airplane.

3. AnimateDiff(ToonYou) + VideoGuide

Silhouette of the couple during sunset.

Traffic in london street at night.

Bonfire near river.

A footage of actor movie scene.

A woman getting out of the car to walk with their dog.

A cute pomeranian dog playing with a soccer ball.

4. AnimateDiff(FilmVelvia) + VideoGuide

Grilling a steak on a pan grill.

Dark clouds overshadowing the full moon.

A corgi's head depicted as an explosion of a nebula.

Deep frying a crab on a wok in high fire.

fighter practice kicking

A zebra taking a peaceful walk.

5. AnimateDiff(RealisticVision) + VideoGuide

A koala bear playing piano in the forest.

A car charging in the parking area.

Couple salsa dancing.

A bear wearing red jersey.

An airplane flying above the sea of clouds.

A male vendor selling fruits.

6. LaVie + VideoGuide

A horse bending down to drink water from a river.

Golden fish swimming in the ocean.

A person playing guitar.

Kids celebrating halloween at home.

Removing a pineapple leaf.

A space shuttle launching into orbit, with flames and smoke billowing out from the engines.

As demonstrated in the samples above, our VideoGuide is the only method to reliably improve temporal quality of inadequate samples without any unwanted trade-offs. This allows for newfound synergistic effects among models: functions such as personalization can be freely utilized while borrowing the temporal consistency of external models.

7. Prior Distillation

Degraded performance due to a substandard data prior is an issue only solvable through extra training. However VideoGuide provides a workaround to this matter by enabling the utilization of a superior data prior. Generated samples are guided towards a result of better text coherence while maintaining the style of the original data domain. For each prompt, the same random seed is shared for both methods.