Season 2
S2-EP08: Synthetic Data, Sim-to-Real Gap, and Distributional Shift
Understanding synthetic data, sim-to-real gap, and distributional shift. Learn about synthetic data generation, domain adaptation, and when simulated data fails in real applications.
Tweet coming soon
Prepare for takeoff! The 'SkyMaster AI' is ready.
You trained this pilot entirely in `Unity`, didn't you?
10 Million flight hours! I used **Synthetic Data Generation**.
I simulated wind, rain, snow, and even Godzilla attacks. This drone has seen everything.
It hasn't seen everything. It has seen a video game.
You trained it on your own imagination.
Data is data, Kurumi! The math is the same!
Launching in 3... 2... 1...
But... but...
In the simulation, it dodged lasers!
In the simulation, the air doesn't have turbulence unless you code it.
In the simulation, the gyroscope doesn't drift because of temperature changes.
I don't understand. The confidence score was 0.99! It *thought* it knew what to do!
That is exactly the problem.
When you flood the model with synthetic data, you create an echo chamber.
The model hears the same simplified message a million times. It becomes **Overconfident**.
It mistakes 'Repetition' for 'Truth.'
Look at the camera feed. See that grain? That's **Sensor Noise**.
It's just a little static.
Your synthetic data was perfect. Zero noise.
Your model learned to navigate using *perfect pixels*.
To the model, this 'static' looks like obstacles. It looks like the world is dissolving.
It panicked because it has never seen a 'dirty' pixel before.
But I added Gaussian noise to the simulation! I added blur!
Real noise isn't Gaussian, Shez.
Real noise is Poisson. It depends on the light level. It depends on the heat of the sensor.
This is **Distributional Shift**.
Your model spent 10 million hours learning Curve A.
It optimized its weights to be a god in Curve A.
When you put it in Curve B (Reality), its 'Prior Beliefs' are so strong that it ignores the new data.
It tries to force reality to fit the simulation.
The map says there should be a farm here! Why is there a skyscraper?!
You are using a map of a different world.
So synthetic data is useless?
No. It's useful. But you have to treat it like **Toxic Waste**.
It's powerful fuel. But if you don't contain it, it poisons the well.
You let the synthetic data overwhelm the real data.
We need to do two things.
1. **Domain Randomization**.
Make the domain... random?
What did you do?! It looks like a glitch!
I broke the physics. I broke the lighting.
If the simulator looks *nothing* like reality... the model can't overfit to the simulator.
By changing the textures to 'Pink' and 'Polka Dots,' we force the model to ignore color.
It has to learn the *shape* of the obstacle, because the color keeps changing.
You're forcing it to learn **Invariant Features**.
Yes. Features that are true in both the nightmare and the real world.
Geometry is real. Texture is a lie.
2. **Loss Re-Weighting**.
Weighting the real data higher?
Even if you only have 1 hour of real flight data, that hour is sacred.
Set $\beta = 100$. Make the model pay 100x more penalty for failing on real data.
So the synthetic data teaches it 'how to fly generally.'
And the real data teaches it 'how to survive specifically.'
Okay. Retrained with Domain Randomization and Weighted Loss.
The confidence score dropped to 0.75.
Good. It's humble now.
A humble pilot checks the wind. An arrogant pilot crashes.
It's flying! It looks a bit shaky, but it's flying!
It's dealing with the noise. But it's actually *thinking* now, not just reciting a script.
Synthetic data is tricky.
It's a tool. Like a hammer.
Next time you want to use a simulator, remember:
Simulation is a hypothesis. Reality is the test. Never grade your own homework
Can I keep the pink sky mode though? It looks cool.
It will start hallucinating cotton candy.
S2-EP07: Feature Selection vs Causal Inference: Proxy Variables and Collider Bias
Understanding feature selection vs causal inference: proxy variables and collider bias. Learn about confounders, mediators, and why feature selection can introduce bias.
S2-EP09: Label Noise, Confident Learning, and Ground Truth Bias
Understanding label noise, confident learning, and ground truth bias. Learn about noisy labels, annotation errors, and how label quality affects model performance.