Stable Diffusion 3 is Here

Stable Diffusion 3 is a new text-to-image model by Stability AI that is currently in an early preview phase. This model boasts improved performance in multi-subject prompts, image quality, and spelling abilities. The suite of Stable Diffusion 3 models ranges from 800M to 8B parameters, offering users scalability and quality options. This comes shortly after Stability AI released Stable Cascade. The model combines a diffusion transformer architecture and flow matching. Safety measures have been implemented to prevent misuse, with ongoing collaboration with experts and the community to ensure responsible AI practices. The aim is to make generative AI open, safe, and universally accessible. Users interested in commercial use of other image models can visit Stability AI’s Membership page or Developer Platform. To follow updates on Stable Diffusion 3, users can connect with Stability AI on various social media platforms.

Stable Diffusion 3 Examples

Prompt Coherence

Prompt: “Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”

For those that don’t get how this is impressive, SDXL and DALL-E below

Stable Diffusion 3 seems to have pretty good prompt coherence. This is very big news if SD3 can understand prompts this well.

Stable Diffusion 3 can handle text

Some images shared by Emad Mostaque, CEO of Stability AI.

Architecture

The model ranges from 800M to 8B parameters and is based on the Sora architecture. It’s their most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Stable Diffusion 3 utilizes a new type of diffusion transformer combined with flow matching, allowing it to scale efficiently and generate high-quality images based on text descriptions called “prompts”.

Diffusion Transformers (DiTs) leverage the power of transformer architecture, which has proven to be highly effective in various natural language processing tasks, and adapt it for image generation.

The use of transformers in DiTs allows for better scalability, robustness, and efficiency compared to traditional U-Net backbones. By replacing the U-Net architecture with transformers, DiTs can process images more effectively and generate higher-quality results. This is evident in the research findings, which show that higher forward pass complexity (measured in Gflops) leads to lower Fréchet Inception Distance (FID) scores, indicating better performance.

Is Stable Diffusion 3 Open Source?

Like prior SD models it will be open source/parameters after the feedback and improvement phase. They are open data for our LMs but not other modalities

This model is not yet widely available but is being offered for early preview through a waitlist to gather insights for further improvements before an open release. Stability AI emphasizes safety practices by implementing safeguards throughout the training, testing, evaluation, and deployment phases to prevent misuse of Stable Diffusion 3.