NelworksNelworks
Season 3

S3-EP08: When AI Dreams In Pictures

Understanding when AI dreams in pictures: vision-language models and multimodal AI. Learn about image generation, visual reasoning, and cross-modal understanding.

Tweet coming soon
I found Waldo. But when I ask the AI where he is, it just says 'He is in the top right'.
That's useless! I want it to tell me he's 'behind the red tent'. Why is it so vague?
And this! I tried to upload this manual to find a wiring diagram. The AI timed out and said 'Image too large'. It's just a PDF!
You're running into two different generations of limitations at the same time. You're discovering that computers are actually legally blind.
Blind? It can generate a photorealistic astronaut riding a horse! How can it be blind?
Generating pictures is easy. *Seeing* is hard. To understand why your AI is failing today, we have to go back to how machines started seeing 20 years ago.
Meet the **CNN**. The Convolutional Neural Network. The grandfather of machine vision.
A CNN is a **Scanner**. It looks at a tiny group of pixels at a time. It finds an edge. Then a curve. Then a texture.
That sounds... logical? It builds the image from the ground up.
It is logical. But it lacks **Context**. The scanner sees a 'tail' on the left and a 'ear' on the right, but because they are far apart, it takes a long time for the math to connect them.
CNNs are great at textures, but bad at 'The Big Picture'. They struggle to understand relationships between distant objects. Like knowing that the tiny hand in the corner belongs to the person in the center.
So, around 2020, researchers killed the Scanner. They replaced it with the **Vision Transformer (ViT)**.
Hey! You ruined it!
I **Patchified** it. This is how GPT-4 and Claude see.
The Transformer changed everything. It treats an image exactly like a sentence. It chops the 2D picture into a sequence of 1D squares. Usually 16x16 pixels each.
Square #1 is the word 'Ear'. Square #50 is the word 'Tail'. The AI doesn't see a grid. It reads a paragraph of visual words.
Why is chopping it up better than scanning?
Because of **Self-Attention**.
In a Transformer, every patch looks at every other patch *instantly*. It doesn't need to scan across. It grasps the **Global Context** immediately.
So it sees the whole cat at once! That's perfect!
It *would* be perfect. If it didn't have the **Quadratic Curse**.
Attention is expensive. If you verify every patch against every other patch... the math explodes. If you double the resolution of the image, the cost goes up **4x**. If you triple it, **9x**.
So... if I feed it a 4K image?
Your GPU melts. The memory runs out instantly. So, engineers made a compromise.
They shrink the image. Before GPT-4 sees your photo, it resizes it. Usually to something like 512x512 pixels. Or it breaks it into a few 512x512 tiles.
That's why it couldn't find Waldo. To the AI, your high-res Waldo was just a single, blurry, brown pixel.
So the smartest AI on Earth sees the world through a dirty, low-res window because looking at high-res is too expensive.
Exactly. Until 2025.
DeepSeek just dropped a bomb on the industry. They figured out that we've been thinking about 'Visual Tokens' completely backwards.
Backwards how?
Traditionally, images are 'heavy'. Text is 'light'. If you have a page of text, it's cheap to process. If you take a *picture* of that page, it costs 10x more tokens to process the pixels.
Right. Because images are big files.
That's what we thought. So everyone used **OCR**. They would take an image, extract the text, and feed the text to the AI. Image -> Text -> Brain.
But DeepSeek asked: 'What if the picture is actually *more efficient* than the words?'
Think about your own brain, Shez. When you remember a page in a textbook, do you memorize the ASCII string of every letter?
No... I remember the layout. I remember the chart was on the left, the bold text was in the middle.
Exactly. You use **Visual Spatial Memory**. It's a compressed representation. DeepSeek realized they could do the same for their **Optical Character Recognition (OCR)**
They trained a new type of encoder. It doesn't chop the image into 'words'. It compresses the *visual concept* of the page.
The results are shocking. They found that 10,000 words of content could be represented by just **1,500 visual tokens**.
Wait... that's **10x compression**.
You're saying the picture of the book is smaller than the text of the book?
To the AI's brain? Yes. For the first time, seeing is cheaper than reading.
If visual tokens are cheap... We may solves the Context Window problem!
Completely. It blows the doors off.
Imagine you're a coder. You have a massive codebase. 100,000 lines of code. Instead of pasting the text and hitting the limit...
...you feed it the screenshots. The AI caches the 'visual memory' of your code.
It's like the physicist **Hans Bethe**. He memorized the periodic table so he never had to look it up. This allows the AI to 'memorize' your entire company's documentation visually.
But... does it understand it? Can it reason over pixels?
That's the trade-off. Can it do complex logic on a compressed visual memory? We're still finding out.
But for retrieval? For finding a needle in a haystack? It didn't need to read the text 'wire A connects to wire B'. It just *saw* the line connecting them.
So we went from scanning pixels... to reading visual sentences... to basically implanting photographic memories.
We stopped trying to turn images into text. We started teaching the AI to dream in pictures.