SinFusion: Training Diffusion Models on a Single Image or Video

Weizmann Institute of Science, Rehovot, Israel
* Equal contribution
(This page contains many videos, please give it a minute to load)

Abstract

Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity. However, they are usually trained on very large datasets and are not naturally adapted to manipulate a given input image or video. In this paper we show how this can be resolved by training a diffusion model on a single input image or video. Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models. It can solve a wide array of image/video-specific manipulation tasks. In particular, our model can learn from few frames the motion and dynamics of a single input video. It can then generate diverse new video samples of the same dynamic scene; extrapolate short videos into long ones (both forward and backward in time) and perform video upsampling. When trained on a single image, our model shows comparable performance and capabilities to previous single-image models in various image manipulation tasks.

Diverse Generation from a Single Video

Input video (top-left)
All others are generated.

Video Extrapolation - Forward in time

Input Frames (RED)
Extrapolated Frames (GREEN)
(The input video, followed by an extrapolation from the last frame of the input video. Note that the video continues exactly where the input finishes in a realistic way)

Video Extrapolation - Backward in time

Input Frames (RED)
Extrapolated Frames (GREEN)
The input video plays in reverse, starting from the last frame to the first frame. It is followed by a backwards extrapolation from the first frame of the input video, back to the past.
Note how backward extrapolations of the same video yield different outputs due to the inherent stochasticity of diffusion models.

Spatial Video Retargeting

Input video (top-left)
The rest are retargeted to different spatial shapes.


Temporal Video Upsampling

First Row: Input video
Second Row: Linear interpolation
Third Row: SinFusion (Ours) interpolation
Fourth Row: RIFE interpolation
Note that :

  • Our (SinFusion) DDPM frame interpolation framework produces comparable results to RIFE (a dedicated recent SOTA frame-upsampling method).
  • On the fan video (left), SinFusion is able to correctly undo the motion-aliasing, whereas RIFE is not. The fan, which rotates faster and faster clockwise, appears at some point to be falsely rotating counter-clockwise (due to the low framerate of the input video). By training SinFusion on this specific video, it automatically learns (from the slower parts of the video) how to correctly undo the motion aliasing when increasing framerate in this scene.

  • Diverse Image Generation

    Single Input Image
    Diverse Generated Images
    Single Input Image
    Diverse Generated Images
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image
    Input image
    SinFusion image

    BibTeX

    @article{nikankin2022sinfusion,
      title={SinFusion: Training Diffusion Models on a Single Image or Video},
      author={Nikankin, Yaniv and Haim, Niv and Irani, Michal},
      journal={arXiv preprint arXiv:2211.11743},
      year={2022}
    }