NelworksNelworks
Season 2

S2-EP09: Label Noise, Confident Learning, and Ground Truth Bias

Understanding label noise, confident learning, and ground truth bias. Learn about noisy labels, annotation errors, and how label quality affects model performance.

Tweet coming soon
The accuracy... it's too high.
Last time I got 99%, I leaked the future. The time before that, I overfitted to a single user.
I'm not falling for it this time. I'm not a rookie anymore.
I'm going to check the 'False Negatives.' The ones the model got 'wrong.'
Show me where you failed, you silicon brat.
...What?
THE LABELS ARE WRONG.
I paid the annotators $50,000! I outsourced this to 'Premium Labeling Corp'!
And they just clicked random buttons?!
It's all garbage! The 'Ground Truth' is a lie!
If the labels are wrong, my loss function is meaningless! I'm optimizing for hallucinations!
My data is poisoned. My career is over. I can't train a model on lies.
You're screaming at spreadsheets again. Is this a new debugging technique?
Kurumi, it's over. The dataset is compromised.
Look at this! It's a Banana labeled as an Apple! 10% of the data is mislabeled!
I have to throw it all away.
You have 100,000 images. You want to throw them away because humans are incompetent?
Garbage In, Garbage Out! That's the first rule of engineering!
That rule is for junior engineers.
Senior engineers know that **Garbage is Information**.
Your model predicted 'Banana' with 98% confidence. The label said 'Apple.'
What does that tell you about your model?
That it's wrong! According to the loss function, it made a huge error!
No. It tells you your model is **Smarter than the Annotator**.
Smarter?
Your model knows math. The teacher is drunk.
If the student is **Confident** ($p=0.99$) and the teacher disagrees... trust the student.
But... the model learned from the teacher! How can it know better?
Because the model saw *thousands* of examples.
The 'Apple' mistake was random (or lazy).
But the visual features of a Banana are consistent across the other 9,000 bananas.
The model learned the **Pattern**, not the **Individual Labels**.
We assume the noise is mostly random.
Deep Learning models fit *easy* patterns (clean data) first. They memorize the *hard* patterns (noisy labels) last.
So... the fact that my model 'failed' on these specific images...
Is actually the model flagging the errors for you.
It's screaming: 'Hey boss, this label doesn't look like the others!'
We are going to use the **Off-Diagonals**.
The what?
We estimate, for example, that the chance an Apple gets mislabeled as a Banana is about 2%. $P(\tilde{y}=Banana | y^*=Apple) = 0.02$.
Most of the time, labels are correct, so the diagonal cells are close to 1. But some pairs—like Cat/Dog—can have slightly higher confusion, like 1.5%.
If this confusion probability is high—say, above 10%—your annotators are probably systematically confused.
But if it's low—like 1% or 2%—and the model is very confident, then it’s likely a specific, fixable data error.
`import cleanlab`.
`cleanlab.filter.find_label_issues(labels, pred_probs)`.
It's scanning the dataset... comparing the model's confidence to the human labels.
(Laughing hysterically) They labeled a cat as a hot dog!
Probably a 'Not Hot Dog' app gone wrong.
They are all errors! The model found them!
My model isn't broken. It's a **Whistleblower**.
Now, you have a choice.
1. Prune the bad data (Delete the 4,000 rows).
2. Correct them (Human-in-the-loop review).
I'm going to prune them. I want a clean dataset.
`df_clean = df.drop(issue_indices)`
Retraining...
It dropped. It was 99.2%. Now it's 97.5%.
Of course it dropped.
Before, you were graded by a drunk teacher. Now you are being graded by reality.
That 1.7% drop was the 'Hallucination Gap.'
You were memorizing the lies. Now you are learning the truth.