Note: Added Prof. Hung-yi Lee lecture notes for fundamentals of diffusion models. This will be helpful to understand the AI By Hand (aibyhand) of Sora’s diffusion Transformer.
SORA Diffusion Transformer AI By Hand Step by Step
Objective: Generate video from a text prompt and diffusion steps.
[1] Inputs:
- Original Video (training data)
- Text Prompt: "sora is sky"
- Diffusion Step: t = 3 (noise level)
[2] Video Patching:
- Divide video frames into spacetime patches (e.g., 2x2 patches).
- Video patching is a technique to divide video frames into smaller, manageable units (patches) that represent both the spatial and temporal information in the video. It simplifies processing, enables the use of transformers, and allows the model to focus on local features. The size and number of patches (e.g., 2x2, 4x4, etc.) can be adjusted based on the specific model and task.
[3] Visual Encoder:
- Encode patches into a lower-dimensional latent space using learned weights, biases, and ReLU activation.
- This reduces data complexity (e.g., from 4 dimensions to 2).
[4] Noise Addition:
- Add noise to the latent representation based on the diffusion step (t). Higher 't' usually means less noise initially.