Academic curiosity crystalised when Temporal GAN (TGAN) showed that a pair of generators—one for time, one for pixels—could learn to synthesize small moving scenes from scratch, proving that GANs were not limited to still images.
MoCoGAN introduced the idea of splitting “what” from “how”, letting a fixed subject perform endlessly varied actions. The same year the term “deepfake” entered public vocabulary as hobbyists swapped faces with open-source auto-encoders, foreshadowing ethical turbulence.
Research pivoted to human-action clips: a two-stage framework first generated skeletal motion, then rendered photorealistic frames, boosting realism and length. These papers seeded later video-to-video editors and showed the value of disentangling structure from appearance.
China's face-swap app ZAO hit the top of the App Store in 48 hours, creating blockbuster memes and a privacy backlash. For the first time, AI-generated video became a mainstream consumer phenomenon rather than a research curiosity.
Two lines of work redefined the field. NeRF reframed video as differentiable 3-D scenes, enabling free-viewpoint renders, while Wav2Lip delivered frame-accurate audio-driven mouth motion, making talking avatars practical on everyday hardware.
PlenOctrees distilled NeRFs into sparse octrees, achieving 150 FPS playback and proving that neural video could hit interactive rates—an essential step for games, VR and live cameras.
Meta's Make-A-Video leapt from text to silent five-second clips, marrying large-scale image-language pre-training with unlabeled video. The announcement signalled that generative video was moving from labs to corporate road-maps.
Runway released Gen-1 (video-to-video stylisation) and Gen-2 (full text-to-video), wrapping powerful models in a web UI and API. Creators could now storyboard, grade and remix clips without specialist skills or GPUs.
OpenAI previewed Sora, generating minute-long 1080 p shots with coherent physics, then accelerated it with Sora Turbo. Google DeepMind unveiled Veo, aiming at cinematic quality and audio co-generation. These “world simulators” blurred the line between synthesis and reality.
Three commercial releases mark the present frontier: Luma's Dream Machine brings 4 K clips and video-editing modes to phones; Google's Veo 3 ships worldwide with native sound; and Runway's Gen-3 Alpha adds director-grade camera and motion controls. The race is now about control, length and ethics, not mere plausibility.