EP06 - Naive Bayes, Independent Assumptions on Correlated Features

Understanding Naive Bayes and independent assumptions on correlated features. Learn about conditional independence, feature correlation, and when Naive Bayes fails.

Katsura Kurumi - AI

@katsurakurumiAI

"How do you catch the Nigerian Prince in emails?" Katsura Kurumi (AI/ML) S1-EP06: Naive Bayes, Independent Assumptions on Correlated Features #KatsuraKurumi #AIart #comic #ML

Kurumi! I just built the ultimate spam filter!

(Leaning against a sorting bin) Really? It looks like you're just throwing envelopes at the wall.

No! It's **Naive Bayes**. The classic!

I trained it on 50,000 emails. It uses TF-IDF vectors!

Shez. Why is this in the Inbox?

W-Well... "Bank" is a safe word. "Account" is a safe word. "Your" is neutral.

The math said $P(Safe) > P(Spam)$!

"Free Money." "Click Here."

Your model can't hear the chorus. It only hears solo artists.

This is how Naive Bayes sees an email. A **Bag of Words**.

The order doesn't matter. The structure doesn't matter.

But that's the assumption! Conditional Independence! It simplifies the computation!

You simplified computation after lobotomizing the context.

In the real world, if you see "Nigerian," the probability of seeing "Prince" next to it skyrockets. They are a **Gang**.

A gang?

They are correlated features. They move together.

But your "Naive" model assumes they are total strangers who just happened to walk into the room at the same time.

You are multiplying their probabilities separately.

Let's say "Prince" appears in fairy tales (Safe). Let's say "Nigerian" appears in geography emails (Safe).

Your model calculates: Safe $\times$ Safe = Very Safe.

But "Nigerian Prince" is 100% Spam. Your model missed the **Interaction**.

Okay, so it misses some phrases. But TF-IDF fixes that, right? It weights the important words!

TF-IDF just makes rare words louder. It doesn't make them friends.

TF-IDF makes "Microbiology" screams louder than "The." But if I say "Bad Microbiology," the model still calculates "Bad" and "Microbiology" separately.

Look what we have here? Seems like we have a case of **Bayesian Poisoning**.

Bayesian Poisoning?

You fool. This is how spammers outsmart you. They stuff the email with "Good Words."

"Love," "Miss," "Family," "Dinner."

Since Naive Bayes just sums the log-probabilities (or multiplies raw probabilities), enough feathers will outweigh the lead.

Because the model assumes "Love" and "Friend" are independent evidence, it counts them as *separate* proof of innocence.

In reality, "Love" and "Friend" are redundant. You are **Double Counting** the fluff.

So the independence assumption creates... overconfidence?

Massive overconfidence. The probabilities are usually pushed to 0.999 or 0.001. It has no nuance.

What happened?!

A new word. A word your training set has never seen.

If the probability is zero, and you multiply...

The whole equation becomes zero.

You must pretend you've seen every word at least once. This prevents the "Zero Frequency" nuke.

Okay, smoothing added. But how do I fix the "Gang" problem? How do I catch "Nigerian Prince"?

You stop using a model that treats words like lonely islands.

Trees.

A Decision Tree splits the universe.

"IF contains 'Nigerian' AND contains 'Prince' THEN Spam."

Trees capture **Non-Linear Interactions**. They capture the dependency.

So I should just trash Naive Bayes?

Not necessarily.

Naive Bayes is fast. It's $O(N)$. It requires almost no RAM.

If you are filtering millions of messages per second for simple sentiment (Positive/Negative), it's fine.

But for adversarial attacks? For complex fraud where context is king? Then you need something like Transformers or BERT.

I'm switching to BERT!

Yeah about that....

Ha! Take that, Nigerian Prince!

BERT is computationally more expensive, but it looks like it paid off this time.

Just remember, no compute is free. You are always trading accuracy for cost and latency.