NelworksNelworks
Season 1

EP03 - UMAP: Still Have To Be Smart With Compressing

Understanding UMAP limitations and best practices. Learn about parameter tuning, distance metrics, and when UMAP compression can mislead.

Tweet coming soon
It's full of stars!
(Typing on a physical keyboard nearby) You're drooling on the headset.
I ditched PCA. I ditched the Linear Algebra dinosaurs.
I found the God Algorithm. **UMAP**.
It handles non-linearity! It unfolds the Swiss Roll! It respects the local geometry!
And best of all, it's topology! I didn't have to worry about the "Unit Problem."
"Topology doesn't care about units?"
You are absolutely wrong.
Shez, how does UMAP build this beautiful graph?
It... finds the nearest neighbors for each point... and then connects them with fuzzy edges... and...
Stop. You said *finds the nearest neighbors.*
How does it measure *Nearest*?
Uh... distance?
Which distance?
The... default distance?
Euclidean Distance. The Pythagorean Theorem. $a^2 + b^2 = c^2$.
To UMAP, Point A is **1000 times closer** to Point B than Point C.
The algorithm thinks the Voltage Spike is "The Same State," but the tiny movement is "Far Away."
So... the neighbors...
The neighbors are determined entirely by the variable with the biggest numbers.
You haven't escaped the Unit Problem. You just hid it inside the Neighbor Search.
But UMAP has local scaling! The $\sigma$ value! It adapts to the density!
It adapts to the *density* of the neighbors it found.
But if it found the *wrong* neighbors because your units were messed up, it's just adapting to garbage.
UMAP is like a camera that focuses automatically.
But if you leave the lens cap on (Wrong Units), it just focuses perfectly on the back of the lens cap.
Look at your "Islands."
Island 1 is "High Meters." Island 2 is "Low Meters."
Where is the Voltage info?
The voltage... is lost. It was treated as noise because it was too small.
Okay. So I *still* have to normalize? `StandardScaler`?
Yes. You always have to normalize if your units are different. There is no "God Algorithm" that fixes bad physics.
Fine. `StandardScaler`. Then UMAP.
*Click.*
Okay, this looks better.
Let me run it again just to be sure.
*Click.*
What?!
It changed shape! Why did it change shape?! Now I can't find ID_420!
Welcome to **Stochasticity**.
PCA is Linear Algebra. It solves for Eigenvectors. It is the same every time.
UMAP is an Optimization Process. It uses random initialization and stochastic gradient descent.
But... which shape is the truth? The Wishbone or the Donut?
Neither. Or both.
They are just 2D shadows of a high-dimensional object. The "Global Structure" in UMAP is not guaranteed to be preserved.
**Global Structure vs. Local Structure.**
UMAP cares about *Neighbors*. It will tear the global map apart to keep neighbors together.
Imagine the Earth.
PCA tries to flatten it like a Mercator projection. It preserves the general "North is Up."
UMAP ensures that New York is next to New Jersey.
But it might put the USA next to China just because they fit together nicely on the table.
So I can trust that the points *in this cluster* are similar...
Yes.
But I can't trust that Cluster A is close to Cluster B?
Exactly. The distance *between* clusters is often meaningless in UMAP.
This "God Algorithm" has a lot of fine print.
It's a tool for **Visualization**, not for rigorous engineering.
So what do I do? I need to use this data for the robot's navigation! I can't feed it a stochastic donut!
If you need features for a model: Use **Autoencoders** (or PCA). They are parametric functions. They are stable.
If you need to make a pretty chart for the CEO: Use UMAP.
And if you *must* use UMAP for analysis... change the metric.
Change it from Euclidean?
Mahalanobis distance. It accounts for the variance and covariance of the data.
It essentially "learns" the units for you.
It looks like a robot leg now.
Because you used a metric that understands how the variables relate to each other.
Don't be seduced by the "Manifold Learning". At the bottom of every fancy algorithm is a simple function calculating `Distance(A, B)`.
If you feed that function garbage units, it will build you a garbage castle.