Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing

1Adobe Research    2KAIST
*Work done during internship at Adobe    Project Leads

TL;DR

We tackle, for the first time, the problem of cross-consistency in multi-turn video editing, and propose Memory-V2V, a unified framework that equips existing video-to-video diffusion models with explicit visual memory. Memory-V2V is validated on challenging V2V tasks, including multi-turn novel view synthesis and text-guided long-video editing.

Multi-turn V2V Results

(a) Multi-turn Video Novel View Synthesis

Input video
1st iteration
2nd iteration
3rd iteration
Input video
1st iteration
2nd iteration

(b) Text-guided Long Video Editing

Input video
Memory-V2V (Ours)
"Change apple to orange"
Input video
Memory-V2V (Ours)
"Make the dark door white"
Input video
Memory-V2V (Ours)
"Add a beautiful hat to a woman"
Input video
Memory-V2V (Ours)
"Change blue jacket into red"
Input video
Memory-V2V (Ours)
"Add a colorful scarlet macaw parrot..."
Input video
Memory-V2V (Ours)
"Change a man's glasses into transparent glasses"
Input video (>900 frames)
Memory-V2V (Ours)
"Change glasses into sunglasses"
Input video (>900 frames)
Memory-V2V (Ours)
"Change the hat into more beautiful hat"

Baseline Comparisons

We compare Memory-V2V against state-of-the-art video-to-video diffusion frameworks, including TrajectoryCrafter[1] and ReCamMaster[2] for video novel view synthesis, and TokenFlow[3], RAVE[4], CCEdit[5], LucyEdit[6] and the FIFO[7]-enhanced variant of LucyEdit for text-guiided long-video editing.

(a) Multi-turn Video Novel View Synthesis

example image

example image

example image

(b) Text-guided Long Video Editing

Input video
TokenFlow
RAVE
CCEdit
LucyEdit
LucyEdit w/ FIFO
Memory-V2V (Ours)
Editing Instruction
"Change apple to orange"
Input video
TokenFlow
RAVE
CCEdit
LucyEdit
LucyEdit w/ FIFO
Memory-V2V (Ours)
Editing Instruction
"Add a beautiful hat to a woman"

Method

Method Illustration 1

Memory-V2V is an efficient finetuning framework that equips video-to-video foundation models with explicit memory. For multi-turn novel view synthesis, previously generated videos and their camera poses are stored in an external cache, and relevant past results are retrieved by computing VideoFOV overlap with the current target trajectory. To control the large number of conditioning tokens, we apply priority-based dynamic tokenization using different kernel sizes and further compresses low-importance frames with an adaptive, learnable convolutional compressor inside the model, enabling compact yet informative memory tokens that preserve cross-consistency across sequential edits.


Method Illustration 2

For long video editing, the input is divided into multiple segments, and each new segment is generated sequentially by retrieving previously produced segments based on source-video similarity. Since dedicated long-video editing datasets are scarce, the target video can be extended using a generative model, and the extended sequence is used as memory during training. In this way, Memory-V2V can easily and reliably incorporate memory by adopting retrieval strategies tailored to each video-to-video task.

Ablation Studies: VideoFOV Retrieval and Adaptive Token Merging

Using VideoFOV makes the results from later iterations (e.g., the 5th) much more consistent with those from the first iteration. Furthermore, with adaptive token merging, we can reduce FLOPs and runtime by more than 30% without any degradation in generation quality.

Comparison Between Adaptive Token Discarding and Merging

Adaptive token merging preserves both semantic and motion information, whereas adaptive token discarding leads to the loss of important motion and structural details. Additionally, manipulating high-responsive tokens leads to a degradation of cross-consistency across generated outputs.

Input video
1st video novel view synthesis result
Low-responsiveness token merging
High-responsiveness token merging
Low-responsiveness token discarding
High-responsiveness token discarding

(Supplementary Material) Proof-of-Concept Experiments: Ideal Context Encoder

We investigate which type of memory encoder best preserves and transfers information by conditioning the model on the states from CUT3R[8], LVSM[9], and a VAE. While CUT3R and LVSM fail to provide sufficiently informative guidance to the video diffusion model, the VAE state effectively conveys the necessary information, leading to consistent generation of same region in different views.

References

[1] Yu, Mark, et al., TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models, ICCV 2025.

[2] Bai, Jianhong, et al., ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, ICCV 2025.

[3] Geyer, Michal et al., TokenFlow: Consistent Diffusion Features for Consistent Video Editing, ICLR 2024.

[4] Ozgur Kara et al., RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models, CVPR 2024.

[5] Feng, Ruoyu et al., CCEdit: Creative and Controllable Video Editing via Diffusion Models, CVPR 2024.

[6] Decart AI Team, Lucy Edit: Open-Weight Text-Guided Video Editing, 2025.

[7] Kim, Jihwan, et al., FIFO-Diffusion: Generating Infinite Videos from Text without Training, NeurIPS 2024.

[8] Wang, Quanqian, et al., Continuous 3D Perception Model with Persistent State, CVPR 2025.

[9] Jin, Haian, et al., LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias, ICLR 2025.