Stability AI Releases Stable Video Diffusion

The future of synthetic video just got real. With the launch of Stable Video Diffusion, Stability AI has unlocked the next frontier of AI creativity, allowing anyone to conjure seamless, high-definition videos from text prompts alone. This groundbreaking new model brings the stunning image generation of Stable Diffusion to life through lifelike motion and sound. But how was this video sorcery created, and what mind-blowing applications does it enable? Read on as we explore the genesis of Stable Video Diffusion, and glimpse the thrilling new era of AI video synthesis dawning before our eyes.

How Stable Video Diffusion Works

Stable Video Diffusion is a latent video diffusion model used for high-resolution, state-of-the-art text-to-video and image-to-video generation. The model is based on latent diffusion models (LDMs) trained for 2D image synthesis, which have been adapted into generative video models by adding temporal layers and fine-tuning them on small, high-quality video datasets. The training of Stable Video Diffusion involves three stages: text-to-image pretraining, video pretraining, and high-quality video finetuning.

The necessity of a well-curated pretraining dataset for generating high-quality videos is emphasized, and a systematic curation process is presented to train a strong base model, including captioning and filtering strategies. The impact of finetuning the base model on high-quality data is explored, and a text-to-video model competitive with closed-source video generation is trained. The model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Additionally, the model provides a strong multi-view 3D-prior and can be used as a base to finetune a multi-view diffusion model that generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. The model’s code and model weights are released at a specific GitHub repository.

The Stable Video Diffusion model is trained in three stages: text-to-image pretraining, video pretraining, and high-quality video finetuning. The necessity of a well-curated pretraining dataset for generating high-quality videos is emphasized, and a systematic curation process is presented to train a strong base model, including captioning and filtering strategies. The impact of finetuning the base model on high-quality data is explored, and a text-to-video model competitive with closed-source video generation is trained. The model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Additionally, the model provides a strong multi-view 3D-prior and can be used as a base to finetune a multi-view diffusion model that generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. The model’s code and model weights are released at a specific GitHub repository.

The Challenges of Training a Video LDM

The challenges of training a video LDM (Latent Diffusion Model) are identified and evaluated in the paper “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets”. The authors identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. They demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. They also explore the impact of finetuning the base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. The challenges include the need for a well-curated pretraining dataset, the impact of finetuning the base model on high-quality data, and the necessity of a systematic curation process to train a strong base model.

The challenges of training a video LDM include:

Necessity of a well-curated pretraining dataset for generating high-quality videos.
Impact of finetuning the base model on high-quality data.
Need for a systematic curation process to train a strong base model.

These challenges are addressed in the paper “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets” by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi Zion, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach.

Unleashing Creativity: Applications of Stable Video Diffusion

Explore the creative potential of the model.

The Future of AI Video

The advancement of latent video diffusion models for high-resolution, state-of-the-art text-to-video and image-to-video generation. The authors identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. They also demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. Furthermore, the paper explores the impact of finetuning the base model on high-quality data and demonstrates that the model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. The authors also show that their model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. They release code and model weights at the GitHub repository of Stability AI.

The paper presents a systematic data curation workflow to turn a large uncurated video collection into a quality dataset for generative video modeling. Using this workflow, the authors train state-of-the-art text-to-video and image-to-video models, outperforming all prior models. They also probe the strong prior of motion and 3D understanding in their models by conducting domain-specific experiments. Specifically, they provide evidence that pretrained video diffusion models can be turned into strong multi-view generators, which may help overcome the data scarcity typically observed in the 3D domain.

The advancement of latent video diffusion models and presents a systematic data curation workflow to improve the performance of generative video modeling. It also demonstrates the potential of pretrained video diffusion models to serve as strong multi-view generators.