EP02 - Video AI Generation

How AI animates photos into videos. Learn about pose estimation, motion diffusion models, pixel warping, frame-by-frame generation, and audio-conditioned lip sync for video AI like Grok Imagine.

Tweet coming soon

No. NO. That's my 9th-grade goth phase. I'm dying from cringe!

That's a static JPEG from a 10-year-old hard drive. It can't be MOVING.

Technically, it's not. You are. The JPEG is just the seed crystal for a generative cascade. The motion is entirely new.

Kurumi! Someone deepfaked my teenage cringe! Now I'll be single forever!

It's not a deepfake in the traditional sense. A deepfake swaps a face onto another video. This is far more profound. It built a universe from a single photograph.

You don't get it! I was about to move on from my dark history! My honor shall never be reclaimed...

Calm down, it's just a meme.

Don't you want to know how your picture moved on its own?

Well, I guess that could be interesting...

First, Grok segments you from the background. Then, it runs a pose estimation algorithm to create a **Keypose Skeleton**.

It finds your joints, your facial landmarks, everything. It turns your photo into a rigged, 3D puppet.

So there's a little voodoo doll of 9th-grade me inside a server somewhere?

Anatomically correct and ready for animation. Yes. Now, for the motion.

The prompt was 'sing a meme song'. The AI has a vast library of motion data from millions of videos. It uses a **Motion Diffusion Model** to generate a plausible sequence of movements for the skeleton.

It starts with random motion and refines it, frame by frame, until it statistically resembles a 'Rick Roll dance'.

Okay, so it has a puppet and it taught the puppet a dance. But the video looks like *me*.

This is the most computationally expensive and beautiful part.

The AI has the original photo and the new position of the skeleton for every single frame of the video. It then asks a billion-dollar question, 60 times per second:

What would the pixels of Shez's face look like if her skeleton was in this new pose?

It uses the original photo as a reference texture. It 'in-paints' and 'warps' the original image to fit the new frame, generating new pixels for parts of her that were occluded.

Don't tell me. It also generated the terrible song?

This is what makes it feel real.

An LLM writes the cringe lyrics. A text-to-singing model generates the audio track. But they don't just happen separately.

The audio generation is *part of the condition* for the motion. The AI is told, 'Generate mouth movements that align with these phonemes and timestamps.' The model is forced to generate lip-sync that matches the generated sounds.

So... someone uploaded my photo. The AI built a puppet of me, taught it a dance based on a text prompt, painted 600 new pictures of it, and synced it to a song it wrote on the spot.

Elegant, isn't it?

So it's not really me. It's a statistical ghost wearing my face as a mask.