Video Interpolation With Diffusion Models

1Google Research, 2Google DeepMind

Input

VIDIM

Abstract

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.

High level figure for the method showing the cascaded diffusion model

Comparisons to prior work

Below we compare VIDIM to other methods in the literature. The two input frames are shown at the left, and the output for the selected method is shown at the right. Click below to change the model!

RIFE
AMT
FILM
LDMVFI
VIDIM

Evaluation Benchmarks

We created subsets of the Davis and UCF101 datasets of 350-400 videos with examples that contain a mixture of large and small motion. Each video consists of 9 frames. The first and last frames are used as input and the middle 7 frames are treated as the groundtruth for the interpolation tasks. To regenerate the datasets, please follow the instructions here .