Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing

Dohun Lee^1,2,* Chun-Hao Paul Huang¹ Xuelin Chen¹ Jong Chul Ye² Duygu Ceylan^1,† Hyeonho Jeong^1,†

¹Adobe Research ²KAIST
^*Work done during internship at Adobe ^†Project Leads

TL;DR

We tackle, for the first time, the problem of cross-consistency in multi-turn video editing, and propose Memory-V2V, a unified framework that equips existing video-to-video diffusion models with explicit visual memory. Memory-V2V is validated on challenging V2V tasks, including multi-turn novel view synthesis and text-guided long-video editing.

Multi-turn V2V Results

(a) Multi-turn Video Novel View Synthesis

Input video

1st iteration

2nd iteration

3rd iteration

Input video

1st iteration

2nd iteration

(b) Text-guided Long Video Editing

Input video

Memory-V2V (Ours)

"Change apple to orange"

Input video

Memory-V2V (Ours)

"Make the dark door white"

Input video

Memory-V2V (Ours)

"Add a beautiful hat to a woman"

Input video

Memory-V2V (Ours)

"Change blue jacket into red"

Input video

Memory-V2V (Ours)

"Add a colorful scarlet macaw parrot..."

Input video

Memory-V2V (Ours)

"Change a man's glasses into transparent glasses"

Input video (>900 frames)

Memory-V2V (Ours)

"Change glasses into sunglasses"

Input video (>900 frames)

Memory-V2V (Ours)

"Change the hat into more beautiful hat"

Baseline Comparisons

We compare Memory-V2V against state-of-the-art video-to-video diffusion frameworks, including TrajectoryCrafter[1] and ReCamMaster[2] for video novel view synthesis, and TokenFlow[3], RAVE[4], CCEdit[5], LucyEdit[6] and the FIFO[7]-enhanced variant of LucyEdit for text-guiided long-video editing.

(a) Multi-turn Video Novel View Synthesis

(b) Text-guided Long Video Editing

Input video

TokenFlow

RAVE

CCEdit

LucyEdit

LucyEdit w/ FIFO

Memory-V2V (Ours)

Editing Instruction

"Change apple to orange"

Input video

TokenFlow

RAVE

CCEdit

LucyEdit

LucyEdit w/ FIFO

Memory-V2V (Ours)

Editing Instruction

"Add a beautiful hat to a woman"

Method

Memory-V2V is an efficient finetuning framework that equips video-to-video foundation models with explicit memory. For multi-turn novel view synthesis, previously generated videos and their camera poses are stored in an external cache, and relevant past results are retrieved by computing VideoFOV overlap with the current target trajectory. To control the large number of conditioning tokens, we apply priority-based dynamic tokenization using different kernel sizes and further compresses low-importance frames with an adaptive, learnable convolutional compressor inside the model, enabling compact yet informative memory tokens that preserve cross-consistency across sequential edits.

For long video editing, the input is divided into multiple segments, and each new segment is generated sequentially by retrieving previously produced segments based on source-video similarity. Since dedicated long-video editing datasets are scarce, the target video can be extended using a generative model, and the extended sequence is used as memory during training. In this way, Memory-V2V can easily and reliably incorporate memory by adopting retrieval strategies tailored to each video-to-video task.

Ablation Studies: VideoFOV Retrieval and Adaptive Token Merging

Using VideoFOV makes the results from later iterations (e.g., the 5th) much more consistent with those from the first iteration. Furthermore, with adaptive token merging, we can reduce FLOPs and runtime by more than 30% without any degradation in generation quality.

Comparison Between Adaptive Token Discarding and Merging

Adaptive token merging preserves both semantic and motion information, whereas adaptive token discarding leads to the loss of important motion and structural details. Additionally, manipulating high-responsive tokens leads to a degradation of cross-consistency across generated outputs.

Input video

1^st video novel view synthesis result

Low-responsiveness token merging

High-responsiveness token merging

Low-responsiveness token discarding

High-responsiveness token discarding

(Supplementary Material) Proof-of-Concept Experiments: Ideal Context Encoder

We investigate which type of memory encoder best preserves and transfers information by conditioning the model on the states from CUT3R[8], LVSM[9], and a VAE. While CUT3R and LVSM fail to provide sufficiently informative guidance to the video diffusion model, the VAE state effectively conveys the necessary information, leading to consistent generation of same region in different views.

References

[1] Yu, Mark, et al., TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models, ICCV 2025.

[2] Bai, Jianhong, et al., ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, ICCV 2025.

[3] Geyer, Michal et al., TokenFlow: Consistent Diffusion Features for Consistent Video Editing, ICLR 2024.

[4] Ozgur Kara et al., RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models, CVPR 2024.

[5] Feng, Ruoyu et al., CCEdit: Creative and Controllable Video Editing via Diffusion Models, CVPR 2024.

[6] Decart AI Team, Lucy Edit: Open-Weight Text-Guided Video Editing, 2025.

[7] Kim, Jihwan, et al., FIFO-Diffusion: Generating Infinite Videos from Text without Training, NeurIPS 2024.

[8] Wang, Quanqian, et al., Continuous 3D Perception Model with Persistent State, CVPR 2025.

[9] Jin, Haian, et al., LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias, ICLR 2025.