S2-EP03: A/B Testing Network Effects and SUTVA Violations

Understanding A/B testing network effects and SUTVA violations. Learn about stable unit treatment value assumption, interference, and when A/B tests fail.

Katsura Kurumi - AI

@katsurakurumiAI

"What if your A/B test is broken and you dont even know??" Katsura Kurumi (AI/ML) S2-EP03 – A/B Testing Network Effects & SUTVA Violations #KatsuraKurumi #AIart #ML #DataScience

We have ignition! The 'Double-Coin Referral' feature is a success!

Define 'Success.'

A 20% Lift! Group B (Treatment) is inviting friends like crazy. Group A (Control) is just chilling.

Wait. Why did the Control Group's activity spike by 5% too?

Oh, that? That's just... magic. The excitement should be what to talk about!

It's not magic, Shez. It's **Contamination**.

Your petri dishes are touching.

Contamination? But I used `user_id % 2`. Even IDs in Control group. Odd IDs in Treatment group.

They are mathematically separated! It's perfect randomness!

Meet Alice (Odd ID, Treatment). Meet Bob (Even ID, Control).

Alice gets the 'Double-Coin' offer. She wants coins. What does she do?

She invites her friends?

Exactly. She invites Bob.

Bob is supposed to be in the 'No Feature' universe.

But suddenly, he gets a notification: 'Alice sent you Coins!'

Now Bob is reacting to the Treatment.

He clicks the link. He engages. Maybe he even gets the coins if your backend is lazy.

But... if Bob engages... my Control baseline goes up.

Correct. And if the Control goes up, your 'Lift' (Treatment minus Control) gets smaller.

You are actually **Underestimating** your success because you broke the silence.

Wait. If I'm underestimating... then the feature is even *better* than I thought?

In this current case? Maybe.

But what if it's a **Competition**?

Competition?

Imagine we are Uber. We give Driver A a bonus. Driver A drives faster and takes all the customers.

Driver B didn't just stay neutral. Driver B actively **lost** money because Driver A existed.

This is **Negative Spillover**.

You calculate: Treatment (High) - Control (Super Low).

You think the bonus is a miracle. But really, you just cannibalized your own fleet.

So my 'Science' is a lie?

Your math assumes **Independence**. Your users are a **Network**.

You violated **SUTVA**.

SUTVA?

Stable Unit Treatment Value Assumption. You must have skipped that class. Think of it as the 'Food Coloring in Water' rule.

Standard A/B testing assumes everyone gets their own glass of clear water.

But a Social Network is like a giant punch bowl. If you add food coloring to one part (Treatment), it spreads—everyone's drink changes color.

Okay. How do I separate the drinks? I can't stop Alice from mixing with Bob.

No. But you can pour Alice and Bob's drinks into the same cup.

Clusters? Like... peanut clusters?

Instead of randomizing *Users*, you randomize *Communities*.

New York City = Treatment.

Chicago = Control.

Alice in New York talks to Bob... who is usually also in New York.

The spillover stays inside the bucket. The contamination is more likely to be contained.

But New York is different from Chicago! New Yorkers walk fast and eat pizza! Chicagoans... eat casserole they call pizza.

The variance will be huge!

Correct. You lose **Statistical Power**.

Since you only have N=2 (Two cities), your error bars are massive. You need lots of clusters (many cities) to make this work.

We don't have enough cities. We're a startup. We only operate in San Francisco.

Then you use the Fourth Dimension.

Time travel?

**Switchback Testing**.

You toggle the feature on and off for the *entire* city at once.

9 AM: Everyone gets the Bonus.

10 AM: Nobody gets the Bonus.

You compare the *aggregate* performance of the Red hours vs. the Blue hours.

But what if someone orders at 9:59 (Treatment) and the driver picks them up at 10:01 (Control)?

That's **Carryover Effect**.

The solution is you delete the data from 9:55 to 10:05. You let the system 'wash out' between cycles.

...like rinsing a brush between paint colors.

So my 'User Split' A/B test is basically useless for viral features?

It's worse than useless. It gives you false confidence.

You think you have a 20% lift, so you roll it out globally.

Then, when everyone has it, the 'novelty' wears off, the competition settles, and your actual revenue stays flat.

And you blame the Engineering team for 'implementation issues.'

Abort! Abort! We are contaminated!

If you have a social graph, run a clustering algo (like Louvain). Group tightly connected friends together.

Treat the *Cluster* as the Unit.

My sample size dropped from 1,000,000 Users to 50,000 Clusters.

But your **Truth** went from 0% to 100%.

(Checking) Wait... I'm in the Control group now. Why did I get this?

Because you're in the same 'Office Cluster' as me.

And I'm in the Treatment group.

You... you invited me?

I wanted the coins.

So you *do* care about the product!

Hurry up and accept the invitation so I can get the coins.