Season 1
EP07 - THE LINEAR DELUSION (REGRESSION & AUTOCORRELATION)
Understanding linear regression and autocorrelation. Learn about time series, autocorrelation, and why linear models fail on temporal data.
Tweet coming soon
And as you can see, based on my machine learning model, our Q4 sales will exceed the GDP of China!
(Spinning in the chair) And we will soon sell more units than Earth population by November?
It's math, Kurumi! **Linear Regression**. y = mx + b.
The R^2 is 0.98!
You fitted a straight line to a cumulative time series.
Yeah! The trend is up!
You didn't find a pattern. You found an angle. You assumed the world is a ramp.
This is reality. It fluctuates. It has cycles.
Sure, but the *general* trend is up.
Your Linear Regression sees this. It minimizes the squared error.
Even if the market crashes today, your line will dutifully predict a record high and create a porcupine chart!
But... I used 'TimeSeriesSplit'! I trained on the past to predict the future!
You validated correctly, but your **Model Assumption** is trash.
OLS Regression assumes the errors are independent.
So?
Independent means: The result of this coin flip does not care about the last coin flip.
That's the **Gamblers Fallacy**! I've heard of it!
But Sales are not coin flips. Sales are dominoes.
If Sales were high yesterday, they are likely to be high today. That is **Autocorrelation**. **Memory** in your data.
Your model assumes today is independent of yesterday. It assumes the deviation from the line is random noise.
Look at your residuals (the distance between the dots and the line).
They... look like waves.
Random noise looks like dust. Your residuals look like a snake.
That means your model left all the interesting information (the cycle) in the trash can.
But the R^2 is high! That proves correlation!
Look. Eating cheese causes bedsheet strangulation.
...really?
NO, YOU FOOL.
This is **Spurious Regression**.
Both variables are increasing over time. Anything that goes up will correlate with anything else that goes up.
So... my sales prediction is just...
A coincidence of trends. It has no predictive power for turning points.
If the trend changes, your model will be the last to know. It will keep pointing at the sky even if the company burns.
Okay. So I can't use Linear Regression on raw sales data.
Not on raw data. You have to kill the trend first.
A Stationary time series has a constant mean and variance. It doesn't wander off to infinity.
You can't predict where the balloon will be in an hour (it could be in space). But you *can* predict where the dog will be (around the stake).
But I want to predict the balloon! I want Sales to go up!
Then don't predict the *Position* (y_t). Predict the **Step** (y_t - y_t-1).
Instead of predicting "Sales will be $1,000,000," predict "Sales will change by +$500."
If I predict the change... the trend disappears?
If sales grow by $10 every day, the *change* is a constant line at 10. That is stationary. Linear Regression can handle that.
'df['diff'] = df['sales'].diff()'.
'model.fit(X, df['diff'])'.
This looks... messy. It's not a straight line anymore.
It's the truth. It shows the volatility.
Now, if you want to be a professional, you use a professional model built for this.
**A**uto**R**egressive **I**ntegrated **M**oving **A**verage.
Sounds like an anime protagonist name.
**AR (AutoRegressive):** Uses past values to predict future (y_t-1 predicts y_t). "If it rained yesterday, it might rain today."
**I (Integrated):** The Differencing we just talked about. Handles the trend.
**MA (Moving Average):** Uses past *errors* to correct the forecast. "I was wrong by 5 units yesterday, let me adjust."
So it learns from its own mistakes?
Yes. It has a feedback loop. Linear Regression is stubborn; it never looks back at its residuals.
'AutoARIMA' is finding the parameters... '(p=1, d=1, q=1)'.
It... it stops going up forever. It flattens out.
That is a conservative, realistic forecast. It says "The trend might continue, or it might stabilize."
And this grey cone?
That's the Confidence Interval. It grows wider the further you look. It admits "I don't know."
Time series is a rabbit hole to get into. Wait till we get into cross-validation, train test splits, regime detections.
Next time, please...
EP06 - Naive Bayes, Independent Assumptions on Correlated Features
Understanding Naive Bayes and independent assumptions on correlated features. Learn about conditional independence, feature correlation, and when Naive Bayes fails.
EP08 - THE TYRANNY OF THE MAJORITY (IMBALANCED RANDOM FORESTS)
Understanding imbalanced datasets and Random Forests. Learn about class imbalance, SMOTE, and why majority class bias affects model performance.