Season 2
S2-EP01: Feature Importance (MDI, Permutation, SHAP) & Causal Misinterpretation
Understanding feature importance methods: MDI, Permutation, and SHAP. Learn about Gini importance bias, correlation vs causality, out-of-distribution problems, and why feature importance explains the model, not reality.
Tweet coming soon
And finally, the AI has spoken! We now know exactly what drives our sales!
Look! `Transaction_ID` is the most important feature!
We need to optimize our Transaction IDs! If we make them bigger, we'll make more money!
Shez. The `Transaction_ID` is a unique number for every row.
Yeah! And the Random Forest loves it! `feature_importances_` gave it a score of 0.4!
Your model is as retarded as you.
This is **Gini Importance** (Mean Decrease in Impurity).
Trees are lazy. They want to separate the data into clean buckets as fast as possible. What is the easiest way to identify a specific row?
A 'Red Shirt' (Low Cardinality) is a weak splitter. An 'ID Card' (High Cardinality) is a perfect splitter.
The tree grabs the ID card every time because it guarantees a 'Pure' leaf node.
It mistakes 'Memorization' for 'Importance.'
So... `Transaction_ID` is just... noise?
It's high-cardinality garbage. The model overfitted to it.
Okay, okay! Ignored. But look at the second one! 'Ice Cream Sales.'
It predicts 'Shark Attacks' perfectly!
We need to ban ice cream to stop the sharks!
(Facepalming) Noooo. Not the correlation fallacy again.
The Sun (Summer) causes Ice Cream Sales.
The Sun (Summer) causes Shark Attacks (people swimming).
Ice Cream and Sharks are **Correlated**, but not **Causal**.
Your Random Forest doesn't know what the Sun is. It only sees the Ice Cream data.
So it says: 'Ice Cream is an important predictor.'
But... if the model works, does it matter?
It matters if you try to *act* on it. If you ban ice cream, the sharks will still eat you.
So `feature_importances_` is useless?
It's useful for **Debugging**. It tells you what the model is looking at.
It does *not* tell you how the universe works.
Is there a better way? A way to prove the feature actually does work?
We stop asking the tree what it *thinks*. We test what it *needs*.
Permutation... shuffling?
Imagine the model is a Jenga tower. We want to know which block is holding it up.
If I scramble (shuffle) a noise feature, the model's accuracy doesn't change.
Importance = 0.
If I scramble 'Price,' the model fails.
Therefore, 'Price' is **Load-Bearing**.
That makes sense! We break the data and see if the model cries!
`from sklearn.inspection import permutation_importance`.
I'm going to shuffle everything!
Wait. There is a catch.
...there's always a catch...
What happens if you shuffle the 'Pregnant' column randomly?
...I'm not woke enough to think this is possible...
That is **Out-of-Distribution (OOD)** data.
When you permute correlated features, you create impossible monsters.
The model has never seen a pregnant man. It might output a crazy prediction, causing a huge error spike.
You will think 'Pregnant' is super important, but really, you just confused the model with nonsense.
Standard importance is biased. Permutation creates monsters. Is nothing real?
Try **SHAP** (Shapley Additive Explanations).
Imagine every feature is a player on a team, trying to pull the prediction away from the baseline.
We want to know: how much does Player A *actually* pull the rope?
To be fair, we try every lineup: Player A alone, with Player B, with Player C and D, every possible combo.
SHAP calculates the *marginal contribution* of each feature, averaged over all these coalitions.
If Player A and Player B always pull together, SHAP splits the credit fairly between them.
It's colorful. Lots of red and blue dots!
It even shows *direction*: High Price (red) might push probability left, Low Price (blue) pushes right.
So SHAP tells me the Truth?
No.
What?!
SHAP explains the **Model**. It faithfully describes the math inside the black box.
If your model uses 'Ice Cream' to predict 'Sharks,' SHAP will faithfully tell you: 'Ice Cream increases Shark Risk.'
It explains the *Model's Logic*. It does not validate the *Model's Logic*.
Feature Importance is a mirror. It reflects the biases, the shortcuts, and the correlations your model learned.
So I shouldn't call it 'Drivers of Revenue.'
I should just call it... 'Stuff the Model Likes.'
Yup. 'These are the levers the AI is pulling. We need to verify if they are actually attached to anything real.'
And delete the `Transaction_ID` column before the shareholder meeting. You're gonna crash our stock when our shareholders sees it.
Deleting!
EP09 - Cross-Validation Data Leakage (Group Leakage)
Understanding cross-validation and data leakage. Learn about group leakage, temporal leakage, and proper cross-validation strategies.
S2-EP02: Recommender System Feedback Loops and Popularity Bias
Understanding recommender system feedback loops and popularity bias. Learn about filter bubbles, recommendation algorithms, and how systems amplify existing preferences.