OpenAI Releases Sora Text-to-Video

Unless your living under a rock, in the world of AI OpenAI has released their first text-to-video model and it is impressive.

Sora is an AI model developed by OpenAI that can create realistic and imaginative scenes from text instructions. It is a text-to-video model capable of generating videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Sora is designed to understand and simulate the physical world in motion, with the goal of training models that help people solve problems requiring real-world interaction. The model can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.

Examples

Unmute. Sora now gets synthetic audio @elevenlabsio. It's prompted by text, but the right conditioning should be on both text and video pixels. Learning an accurate video->audio mapping would also require modeling some *implicit* physics in the latent space.

Here's what an… pic.twitter.com/HjCXh7zb30
— Jim Fan (@DrJimFan) February 18, 2024

Sora can generate multiple videos side-by-side at the same time.

Sora can generate multiple videos side-by-side simultaneously.

This is a single video sample from Sora. We didn't stitch this together; Sora decided it wanted to have five different viewpoints all at once! pic.twitter.com/hGae80tM5a
— Bill Peebles (@billpeeb) February 17, 2024

Sora can combine videos

Sora can combine videos pic.twitter.com/SSZnEcXOlR
— Tsarathustra (@tsarnick) February 16, 2024

Sora can follow up and edit videos.

As you know, my explorations of the Gen AI space is ultimately all about creative control. You should be able to shape the generative matter using all your artistic sensibilities and your aesthetic sense.

OpenAI's Sora is a huge technological leap, but what excites me the most… pic.twitter.com/NQGfLRiq75
— Martin Nebelong (@MartinNebelong) February 16, 2024

The Architecture

According to Sora’s technical report, Sora’s architecture involves turning visual data into patches, compressing videos into a lower-dimensional latent space, training a network that reduces the dimensionality of visual data, and extracting a sequence of spacetime patches which act as transformer tokens. Sora is a diffusion model that scales effectively as a video model and can generate videos with variable durations, resolutions, and aspect ratios. It can also be prompted with other inputs, such as pre-existing images or video, enabling a wide range of image and video editing tasks. Additionally, Sora exhibits emerging simulation capabilities, such as 3D consistency, long-range coherence, object permanence, interacting with the world, and simulating digital worlds. However, it also has limitations in accurately modeling the physics of basic interactions and other failure modes.

Sora is a comprehensive diffusion transformer model that processes text or images and generates video pixel output. By analyzing vast volumes of video data using gradient descent, Sora acquires an internal understanding of physical dynamics, essentially forming a trainable simulation or “world model.” While Sora doesn’t directly integrate Unreal Engine 5 (UE5) into its processing loop, it can incorporate text and video pairs created with UE5 into its training data as synthetic examples.

Limitations

Sora’s emergent physics understanding is still fragile and imperfect.

Sora dream physics. pic.twitter.com/CxbnQgxMo2
— Andrew Curran (@AndrewCurran_) February 16, 2024

Despite extensive research and testing, OpenAI acknowledges that it cannot predict all the beneficial ways people will use the technology, nor all the ways people will abuse it. The model is based on a diffusion architecture and uses a transformer architecture similar to GPT models. It builds on past research in DALL·E and GPT models, using the recaptioning technique from DALL·E 3 to follow the user’s text instructions in the generated video more faithfully. Sora serves as a foundation for models that can understand and simulate the real world, a capability believed to be an important milestone for achieving AGI.