EP04 - K-means - Clustering That Is Outlier Sensitive

Understanding K-means clustering and outlier sensitivity. Learn about centroid-based clustering, outlier impact, and alternative clustering methods.

Katsura Kurumi - AI

@katsurakurumiAI

"Round up with K-means, or elongate with GMM!" Katsura Kurumi AI/ML S1-EP04: Clustering & Sensitivity #KatsuraKurumi #AIart #comic #ML

I created the future of logistics! I reorganized the entire inventory system!

By "reorganized," do you mean "randomized"?

No! I used Machine Learning! **K-Means Clustering**.

I grouped all 50,000 SKUs based on Price and Sales Velocity.

Explain that shelf.

Oh, that's Cluster_3. The "Mid-Tier" cluster.

You think a handbag and toilet paper belong in the same logistics group?

Mathematically, yes! They fell into the same Voronoi cell!

Shez, you used K-Means on `Price` vs `Volume`. Let me guess. You normalized the data first?

O-Of course! `StandardScaler`. I'm not an amateur.

And how did you pick `k`?

The Elbow Method! I plotted the inertia. The curve bent at `k=4`. I've been learning!

Then you haven't finished your lesson. You used a spherical cookie cutter on dough that is shaped like a baguette.

Your data looks like this. Long streaks. Elongated shapes.

But K-Means assumes **Spheres**. It assumes the data is a ball.

Look at Cluster 3. It grabbed the expensive end of the cheap items and the cheap end of the expensive items.

But... the centroid minimizes the variance!

It minimizes *Euclidean distance*.

It draws a circle around a center. It doesn't care about the *shape* or *density* of the group.

K-Means places a gravity well. If a point is close to the center in straight-line distance, it gets sucked in.

But isn't that what we want? Things that are close together?

What if your data is shaped like a banana? Or a ring?

The average of a banana is in the empty space between the ends.

K-Means will put the centroid here. And it will split the banana in half.

What is that?

That's the "Super Prototype Engine." We only sold one. It costs $1,000,000.

And why is it grouped with these $50 blenders?

Uh... well...

That Engine is an **Outlier**.

In K-Means, the centroid is the *Mean*. The Mean is a slave to outliers.

$(1 + 1 + 1 + 1000) / 4 = 251$.

The outlier dragged the average from 1 to 251. Now the center of your cluster is nowhere near the actual data.

So my "Mid-Tier" cluster is actually "Cheap Blenders + One Jet Engine"?

Yup. You wanted "Segmentation." K-means gave you "Mathematical Averaging."

But everyone uses K-Means! It's in every Youtube tutorial! "Customer Segmentation 101"!

Because it's easy to teach. And it works... if your customers are spherical blobs floating in a vacuum.

Real data has covariance. Real data has shapes.

High-end items have high price variance but low volume variance. Cheap items are the opposite.

You need a shape that can stretch. You need to model the **Covariance Matrix**.

Enter **GMM** (Gaussian Mixture Models).

Gaussian... Mixture?

K-Means puts a hard fence around a circle. GMM puts a soft, probabilistic cloud over an ellipse.

GMM learns the *shape* of the cluster. It learns that "Luxury" means "High Price range, but narrow Volume range."

And because it's probabilistic, it knows the Jet Engine doesn't really fit. It's a "Low Likelihood" point.

It... it actually makes sense now.

So K-Means is... trash?

No. K-Means is a **Compression** algorithm.

It's great for compressing colors in an image (Vector Quantization). It's great when you want to partition space evenly for indexing.

But for **Inference**? For understanding the *structure* of your business? It is a blunt instrument.

I'm switching to GMM! Or maybe DBSCAN?

DBSCAN is good if you have weird shapes and noise, but it fails if the density varies too much.

Inventory data usually has varying density. GMM is your safest bet for this.

Just remember: **Covariance Type = 'full'**. Don't let it force spheres again.

`GaussianMixture(n_components=4, covariance_type='full')`.

Running it now.

It works! The robot knows the engine is special!

Another crisis averted. You're welcome.

Wait! How do I pick `k` for GMM? Do I just guess?

Elbow. Silhoutte score. **BIC** (Bayesian Information Criterion).