EP08 - THE TYRANNY OF THE MAJORITY (IMBALANCED RANDOM FORESTS)

Understanding imbalanced datasets and Random Forests. Learn about class imbalance, SMOTE, and why majority class bias affects model performance.

Katsura Kurumi - AI

@katsurakurumiAI

"99% accuracy. 0% justice. And a 100% reason to watch out for imbalanced datasets!" Katsura Kurumi (AI/ML) S1-EP08: Tyranny of majority (SMOTE imbalanced dataset) #KatsuraKurumi #AIart #comic #ML

The bank is secure, Kurumi! My AI is active!

Secure?

Look at that score! I trained a Random Forest on the transaction logs. It's practically perfect.

99.9% accuracy on fraud detection. Okay. Let's test it.

Wait... hold on...

It... it didn't flag you.

I just robbed you, and your model held the door open for me.

But the accuracy! It's 99.9%! How can it be wrong if the math says it's right?

Silly, its because you are suffering from **Class Imbalance**.

How many transactions happen a day?

One million.

999,000 are legit. 1,000 are fraud.

If I write a script that says `return "Legit"` for *everything*...

It would be right 999,000 times... and wrong 1,000 times.

...99.9% accuracy.

Nooo. My model isn't smart. It's just lazy. It figured out that "Safe" is always a safe bet.

It's **KPI Cosplay**. You optimized for a metric that hides crime.

But why did the Random Forest do this? It's an ensemble! It's supposed to be robust!

Random Forest uses **Bagging** (Bootstrap Aggregating). Each tree gets a random subset of the data.

Since Fraud is so rare (0.1%), most of these trees didn't even *see* a fraud case in their training bag.

They have no concept of theft. They only know honest citizens.

"I saw a crime! I saw it!"

Maybe 1 or 2 trees actually got the fraud data.

Transaction Safe?

YES! YES! YES! (Deafening roar).

This is **Majoritarian Rule**. The forest takes a majority vote.

98 votes for Safe. 2 votes for Fraud.

Result: Safe.

I-It's the Tyranny of the Majority! The trees are suppressing the truth!

The algorithm is designed to minimize error. To the algorithm, the 2 Fraud trees look like **Noise**.

And it gets deeper. Look at how the trees split. **Gini Impurity**.

The measure of disorder?

The Tree wants to separate these. But it's hard to isolate 10 tiny balls.

The Tree says, "Great job! I purified the big bucket!"

It doesn't care that the Red balls are still mixed in. It reduced the *average* impurity.

So the Fraud gets treated as an accepted impurity?

Exactly. The model thinks, "It's just a few bad apples, not worth a dedicated branch."

I need to fix this! I need to copy-paste the fraud rows and duplicate them 100 times!

**Oversampling** (copy-pasting) can lead to overfitting. The model just memorizes those specific frauds after *double counting voters*.

What about **SMOTE**? Generating synthetic frauds?

SMOTE helps. It creates new examples *between* existing frauds.

But be careful. If you SMOTE *before* you split your Test Set, you leak data. You must split first, then SMOTE only the Training set.

But there is a more elegant way. We don't need more data. We need to rig the election.

Class weight?

We tell the model: "Being wrong about a Fraud is **50 times worse** than being wrong about a Legit transaction."

We give the minority voters a louder voice.

Now, one "No" from a Fraud tree cancels out 50 "Yes" from the Legit trees.

FRAUD DETECTED!

This modifies the Gini calculation. It forces the tree to make splits that isolate the Red balls, even if it's "inefficient" for the Blue ones.

Nooo! My Accuracy dropped! It's down to 92%!

But look at your Recall. It went from 0% to 85%.

That means... I'm catching 85% of the thieves?

Yes. But you are also flagging some innocent people as fraud (False Positives). That's why accuracy dropped.

Is that okay?

Would you rather annoy a customer with a "Did you buy this?" text, or let a hacker drain their account?

Annoy the customer. Definitely.

Finally, stop using the default `0.5` threshold.

The probability cutoff?

Random Forest outputs a probability (e.g., 0.3 chance of fraud).

In a balanced world, 0.5 makes sense. In a fraud world, even a **0.1** (10%) chance is scary.

It works! It sees you!

Finally. Now you can brag about your model. Not about the Accuracy, but about the **F1-Score**.

Now give me back the gold bar you pocketed from the start.

...How did you know?

My Recall is 100%.