SinFusion: Training Diffusion Models on a Single Image or Video

ICML 2023

Yaniv Nikankin*, Niv Haim*, Michal Irani

Weizmann Institute of Science, Rehovot, Israel

* Equal contribution

(This page contains many videos, please give it a minute to load)

Abstract

Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity. However, they are usually trained on very large datasets and are not naturally adapted to manipulate a given input image or video. In this paper we show how this can be resolved by training a diffusion model on a single input image or video. Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models. It can solve a wide array of image/video-specific manipulation tasks. In particular, our model can learn from few frames the motion and dynamics of a single input video. It can then generate diverse new video samples of the same dynamic scene, extrapolate short videos into long ones (both forward and backward in time) and perform video upsampling. Most of these tasks are not realizable by current video-specific generation methods.

Diverse Generation from a Single Video

Input video (top-left)
All others are generated.

Video Extrapolation - Forward in time

Input Frames (RED)
Extrapolated Frames (GREEN)

(The input video, followed by an extrapolation from the last frame of the input video. Note that the video continues exactly where the input finishes in a realistic way)

Video Extrapolation - Backward in time

Input Frames (RED)
Extrapolated Frames (GREEN)

The input video plays in reverse, starting from the last frame to the first frame. It is followed by a backwards extrapolation from the first frame of the input video, back to the past.
Note how backward extrapolations of the same video yield different outputs due to the inherent stochasticity of diffusion models.

Spatial Video Retargeting

Input video (top-left)
The rest are retargeted to different spatial shapes.

Temporal Upsampling

First row Input video
Second row Linear interpolation
Third row: SinFusion (Ours) interpolation
Fourth: RIFE [1] interpolation
Note that :

Our (SinFusion) DDPM frame interpolation framework produces comparable results to RIFE [1] (a dedicated recent SOTA frame-upsampling method).

On the fan video (left), SinFusion is able to correctly undo the motion-aliasing, whereas RIFE is not. The fan, which rotates faster and faster clockwise, appears at some point tob e falsely rotating counter-clockwise (due to the low framerate of the input video). By training SinFusion on this specific video, it automatically learns (from the slower parts of the video) how to correctly undo the motion aliasing when increasing framerate in this scene.

Diverse Image Generation

Single Input Image	Diverse Generated Images	Single Input Image	Diverse Generated Images

BibTeX

@inproceedings{nikankin2022sinfusion,
  title={SinFusion: Training Diffusion Models on a Single Image or Video},
  author={Nikankin, Yaniv and Haim, Niv and Irani, Michal},
  booktitle={International Conference on Machine Learning},
  organization={PMLR},
  year={2023}
}