Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

1Adobe Research    2KAIST

TL;DR

We tackle, for the first time, the problem of cross-consistency in multi-turn video editing, and propose Memory-V2V, a unified framework that equips existing video-to-video diffusion models with explicit visual memory. Memory-V2V is validated on challenging V2V tasks, including multi-turn novel view synthesis and text-guided long-video editing.

Multi-turn V2V Results

(a) Multi-turn Video Novel View Synthesis

Input video
1st iteration
2nd iteration
3rd iteration
Input video
1st iteration
2nd iteration

(b) Text-guided Long Video Editing

Input video
Memory-V2V (Ours)
"Change apple to orange"
Input video
Memory-V2V (Ours)
"Make the dark door white"
Input video
Memory-V2V (Ours)
"Add a beautiful hat to a woman"
Input video
Memory-V2V (Ours)
"Change blue jacket into red"
Input video
Memory-V2V (Ours)
"Add a colorful scarlet macaw parrot..."
Input video
Memory-V2V (Ours)
"Change a man's glasses into transparent glasses"
Input video (>900 frames)
Memory-V2V (Ours)
"Change glasses into sunglasses"
Input video (>900 frames)
Memory-V2V (Ours)
"Change the hat into more beautiful hat"

Baseline Comparisons

We compare Memory-V2V against state-of-the-art video-to-video diffusion frameworks, including TrajectoryCrafter[1] and ReCamMaster[2] for video novel view synthesis, and LucyEdit[3] and the FIFO[4]-enhanced variant of LucyEdit for text-guiided long-video editing.

(a) Multi-turn Video Novel View Synthesis


(b) Text-guided Long Video Editing

Input video
LucyEdit
LucyEdit w/ FIFO
Memory-V2V (Ours)
Editing Instruction
"Change apple to orange"
Input video
LucyEdit
LucyEdit w/ FIFO
Memory-V2V (Ours)
Editing Instruction
"Make the dark door white"
Input video
LucyEdit
LucyEdit w/ FIFO
Memory-V2V (Ours)
Editing Instruction
"Add a beautiful hat to a woman"

Method

Method Illustration 1

Memory-V2V is an efficient finetuning framework that equips video-to-video foundation models with explicit memory. For multi-turn novel view synthesis, previously generated videos and their camera poses are stored in an external cache, and relevant past results are retrieved by computing VideoFOV overlap with the current target trajectory. To control the large number of conditioning tokens, we apply priority-based dynamic tokenization using different kernel sizes and further compresses low-importance frames with an adaptive, learnable convolutional compressor inside the model, enabling compact yet informative memory tokens that preserve cross-consistency across sequential edits.


Method Illustration 2

For long video editing, the input is divided into multiple segments, and each new segment is generated sequentially by retrieving previously produced segments based on source-video similarity. Since dedicated long-video editing datasets are scarce, the target video can be extended using a generative model, and the extended sequence is used as memory during training. In this way, Memory-V2V can easily and reliably incorporate memory by adopting retrieval strategies tailored to each video-to-video task.

Ablation Studies: VideoFOV Retrieval and Adaptive Token Merging

Using VideoFOV makes the results from later iterations (e.g., the 5th) much more consistent with those from the first iteration. Furthermore, with adaptive token merging, we can reduce FLOPs and runtime by more than 30% without any degradation in generation quality.

Proof-of-Concept Experiments: Ideal Context Encoder

We investigate which type of memory encoder best preserves and transfers information by conditioning the model on the states from CUT3R[5], LVSM[6], and a VAE. While CUT3R and LVSM fail to provide sufficiently informative guidance to the video diffusion model, the VAE state effectively conveys the necessary information, leading to consistent generation of same region in different views.

Comparison Between Adaptive Token Discarding and Merging

Adaptive token merging preserves both semantic and motion information, whereas adaptive token discarding leads to the loss of important motion and structural details.

Input video
w/ Adaptive Token Discarding
w/ Adaptive Token Merging

References

[1] Yu, Mark, et al., TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models, ICCV 2025.

[2] Bai, Jianhong, et al. ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, ICCV 2025.

[3] Decart AI Team, Lucy Edit: Open-Weight Text-Guided Video Editing, 2025.

[4] Kim, Jihwan, et al., FIFO-Diffusion: Generating Infinite Videos from Text without Training, NeurIPS 2024.

[5] Wang, Quanqian, et al., Continuous 3D Perception Model with Persistent State, CVPR 2025.

[6] Jin, Haian, et al., LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias, ICLR 2025.