S2-EP04: Post-Selection Inference and Multiple Comparisons Problem

Understanding post-selection inference and multiple comparisons problem. Learn about p-hacking, false discovery rate, and why data snooping invalidates statistical tests.

Katsura Kurumi - AI

@katsurakurumiAI

"POV: The did a regression and discovered SIGNIFICANCE The catch? Might be pure luck..." Katsura Kurumi (AI/ML) S2-EP04 – Post-Selection Inference & Multiple Comparisons #KatsuraKurumi #AIart #ML #DataScience

WE CURED IT!

We cured what? The common cold? Death? Or just your boredom?

Look! We screened 50,000 chemical compounds against the virus.

I found the 'Golden Three.'

Compound X-99. p-value: 0.001!

$p = 0.001$. That's very significant.

I know! The chance of this happening by luck is 1 in 1,000! It's virtually impossible that this is a fluke!

Shez. You tested 50,000 compounds.

Yeah! High-Throughput Screening! Big Data!

If you roll a 1,000-sided die... 50,000 times... how many times will you roll a '1'?

Uh... 50 times?

Exactly. You rolled the dice 50,000 times. You got 50 'miracles' purely by chance.

And you picked the top 3 and showed them to me.

But I filtered them! I threw away the failures!

That is exactly the crime.

Imagine a cowboy shooting at a barn. He has no aim. He just sprays bullets.

Then, *after* seeing where the bullets hit, he paints a target around them.

'Look! I'm a sniper!'

You aren't a sniper. You're a fraud. You conditioned the target on the result.

You cannot calculate a p-value on the winners if you ignore the losers.

You are assuming the hypothesis was 'Is X-99 effective?' *before* you started.

But your hypothesis was actually 'Is *Any* of these 50,000 effective?'

Well... yes. I wanted to find the best one.

If you use a 0.05 threshold... the probability of finding a fake 'cure' in this room is **100%**.

So... my p-value isn't 0.001?

Your p-value is garbage. It's **Post-Selection Inference**.

You used the data to *generate* the hypothesis, then used the *same data* to *prove* it. That is **Double Dipping**.

But look at the effect size! It killed 90% of the virus! The average compound only killed 10%.

Even if it's lucky, it's also strong!

Enter the **Winner's Curse**.

Imagine everyone has a 'True Skill' (Average Jump Height) + 'Luck' (Wind, Shoes, Breakfast).

The score you see is actually your real Skill plus Luck (Noise). Like: Score = Skill + Noise.

To be the absolute #1 winner out of 50,000, you need two things: High Skill... AND Massive Luck.

The winner is almost always the person who had the *most positive noise*.

Now, make him jump again.

Okay.

What happened? Did he get tired?

No. His luck ran out. The noise went back to zero.

His score **Regressed to the Mean**.

This compound probably works a *little* bit. But your measurement says '90% efficacy.'

That 90% is likely 40% Reality + 50% Luck.

So if I send this to clinical trials...

It will fail. Just like 90% of drugs do in Phase 2. Because you selected the outlier, not the truth.

So I can't screen? I can't look for the best?

Do I have to test one drug at a time for the rest of my life?

You can screen. But you have to pay the tax.

Option 1: **The Bonferroni Correction**.

Sounds like a pasta.

You divide your p-value threshold by the number of tests.

$0.05 / 50,000 = 0.000001$.

X-99's p-value is 0.001. The required threshold is 0.000001.

It fails. It's not significant anymore.

Correct. Bonferroni is brutal. It assumes the worst.

That's too hard! I'll never find anything! Is there an easier way?

Option 2: **Data Splitting**. The Engineering Way.

You screened 50,000 compounds? Great.

Use the first 50% of the data (samples) to **Select** the winners.

Okay, I picked X-99, Y-20, and Z-01 based on Set A.

Now, throw away all the data from Set A. Forget it exists.

Test those 3 winners on **Set B**.

In Set B, you are only testing 3 hypotheses. Not 50,000.

So you don't need a crazy correction. $0.05 / 3$ is easy to beat.

And because Set B is fresh... the 'Luck' from Set A doesn't carry over!

Exactly. If X-99 was just lucky noise in Set A, it will likely be average in Set B. The lie will reveal itself.

Splitting the dataset!

Training Set (Screening): Identifying candidates...

Validation Set (Confirmation): Testing the candidates...

X-99 was a fraud! It regressed to the mean!

But Z-01... Z-01 is the real deal!

X-99 was the Texas Sharpshooter's target.

Z-01 is a verified hit.

We actually found one. A real one.

You found one because you stopped lying to yourself about the 49,999 failures.

Now run a replication study on a different day, with different reagents, by a different person.

What? Why?!

Because **Batch Effects** are a whole other nightmare, but perhaps we will leave that for another time.