S2-EP07: Feature Selection vs Causal Inference: Proxy Variables and Collider Bias

Understanding feature selection vs causal inference: proxy variables and collider bias. Learn about confounders, mediators, and why feature selection can introduce bias.

Katsura Kurumi - AI

@katsurakurumiAI

"Correlation is not causation. How many times must I teach that to you, old man?" Katsura Kurumi (AI/ML) S2-EP07 – Feature Selection vs Causal Inference #KatsuraKurumi #AIart #ML #DataScience

I've solved it, Kurumi! I know why the *Nebula Orchids* are dying!

Let me guess. You fed the sensor data into a model?

I used **Lasso Regression**. It zeroed out all the noise and left me with the one true killer.

The feature `leaf_color_yellow` has the highest coefficient!

The model says 'Yellow Leaves' predict death with 99% accuracy.

So, if I paint the leaves green, the plants will live! Causal intervention!

Shez. Put the paint down.

But the data says—

The data says *correlation*. You are about to commit a war crime against botany.

You used **Feature Selection**. You asked the model: 'What helps you *guess* the outcome?'

Right. And if it helps guess, it must be important!

In reality, the Fungus causes the Yellow Leaves. The Fungus *also* causes Death.

Yellow Leaves are a **Proxy** (or a Mediator). They are a symptom.

But to your Lasso model, the 'Yellow Leaf' signal is super clear. It's easy to measure.

The 'Fungus' is hard to measure (it's underground). It's noisy.

So Lasso says: 'I don't need the Fungus variable. Yellow Leaves give me all the info I need.'

It drops the Cause and keeps the Symptom.

But... if I get rid of the yellow, doesn't it fix it?

No, you walnut.

Your model finds that 'Loud Beeping' is the #1 predictor of 'House Burning.'

So your solution is to cut the wires to the alarm.

Ideally, the fire should stop now.

The fire does not care about your regression coefficients.

Okay. So 'Yellow Leaves' is a proxy.

But what about this other feature? 'High Soil Moisture.' That's a cause, right?

Ah. Now we enter the **Collider Trap**.

Is that a wrestling move?

Two things cause wet soil. The Sprinkler. Or the Rain.

Your dataset has a **Selection Bias**. You only collect data when the sensors trigger 'Wet Soil' (because dry soil is boring).

You have 'Conditioned on a Collider.'

If the Soil is Wet, and I tell you 'It didn't Rain'... what do you know about the Sprinkler?

The sprinkler must be ON.

Exactly. By looking only at Wet Soil, you created a fake correlation between Rain and Sprinklers.

Your model thinks: 'Rain prevents Sprinklers.'

But modern Sprinklers *do* turn off when it rains!

Not these ones! These are dumb timers!

But the *data* makes them look smart because of how you filtered it.

Feature Selection is blind to structure.

It grabs whatever correlation exists in the filtered slice of reality you gave it.

It doesn't know if the link is causal, reverse-causal, or completely imaginary.

So... Lasso is useless? RFE is a lie?

No. They are for **Passive Prediction**.

If you just want to *bet* on the stock market or predict crop yield, use Lasso.

If you see Yellow Leaves, bet 'Short.' You will make money.

You don't need to know *why* it's dying to profit from its death.

But if you want to **Intervene**, EG: if you want to *cure* the patient or save the crop, you need **Causality**.

You need to know `P(Y | do(X))`.

How do I find the cause then? The model won't tell me.

You have to tell the model.

Me?

Draw the **DAG** (Directed Acyclic Graph).

Use your biology knowledge. Does Fungus cause Yellow Leaves? Or do Yellow Leaves cause Fungus?

Fungus causes Yellow Leaves.

Good. Now force the model to keep 'Fungus' and ignore 'Yellow Leaves' if you want to test a fungicide.

Use **Instrumental Variables** or **Propensity Matching**.

Do not just run `.fit(X, y)` and pray.

Painting the leaves green was... pretty stupid, huh?

It was a data-driven decision that is stupid when applied to real world context.

Now, if you want to save these orchids, stop looking at the feature importance chart.

Dig up the roots. Look for the rot.

Hey, Kurumi?

Yeah?

`Root_Rot` wasn't even in my dataset. The sensor is only above ground.

That is the **Omitted Variable Bias**.

The most important feature is usually the one you forgot to measure.

So what now?

Recognize most researchers were in the same place as you before. Now correct the silly mistake, I'm going to check out the flower on the other greenhouse.