EP01 - WhatsApp Architecture

How WhatsApp scaled to 450 million users with 32 engineers. Learn about Erlang, BEAM VM, preemptive scheduling, store-and-forward messaging, and why WhatsApp used FreeBSD instead of cloud infrastructure.

Tweet coming soon

My chat app just crashed the dev server again.

I only simulated 10,000 users! I added three more AWS instances and it's *still* choking.

Amazing, your Docker swarm's leaking more mem than my friend's OF subs..

It's a modern stack! Microservices, Docker, React...

It's bloat. You're assigning an entire OS thread to every user connection.

Thread stack size is 1MB. 10,000 users = 10GB of RAM just for *existing*.

How else do you do it? You need a thread to listen for messages.

In 2014, WhatsApp had 450 million active users. They had 32 engineers.

And they didn't use AWS auto-scaling groups.

They ran it on bare metal. FreeBSD. And they hit **2 million connections per server**.

2 million? One box? Impossible. The TCP port limit is 65k.

Well, the port limit is per *IP address*. You just bind multiple virtual IPs to the same **Network Interface Card** (NIC).

But the real magic wasn't the network. It was the **Virtual Machine** VM.

This is your server. Every user gets a Giant.

The Giant needs memory, a passport (Context Switch), and a schedule.

This is **Erlang**. The language of the 1980s telephone monopoly.

An Erlang process isn't an OS thread. It's a virtual actor. It weighs 300 words. That's about 2 Kilobytes.

You can fit 500 Erlang ants in the space of one Java Giant.

So when you crash at 10k users, they are just warming up at 5M users.

But... if one ant crashes? Doesn't the whole hill die?

That's the **'Let It Crash'** philosophy.

In C++ or Java, if a thread corrupts memory, the whole application panics. Segfault.

Erlang processes share *nothing*. No shared memory. They only talk by mailing letters (Message Passing).

If an ant dies, it dies alone. The Supervisor notices, sweeps up the body, and spawns a fresh clone. 1 microsecond. Nobody notices.

Okay, fine. It's lightweight. But how does it handle CPU? If one user runs a heavy calculation, doesn't it block the core?

Ah. The **Scheduler**.

In Node.js, if you write `while(true)`, the server hangs. It's 'Cooperative Multitasking'. The code has to *choose* to yield.

Developers are stupid. They forget to yield. So one heavy request freezes the chat for everyone.

The BEAM (Erlang VM) is a dictator. It uses **Preemptive Scheduling**.

Every process gets 2,000 'Reductions'—basically function calls.

Hit 2,000? **BAM.** You're paused. Next process. I don't care if you're calculating Pi, you wait your turn.

So... perfect fairness?

Exactly. That's why you can upload a 4K video while 50,000 other people send text messages on the same core. The video upload doesn't starve the texts.

Okay. So we have millions of ants. But where do they store the messages? The database must be huge.

What database?

The... message database? The one with my chat history?

Shez. WhatsApp is a **Router**, not a **Vault**.

Server receives message -> Is Recipient Online? -> Yes -> Deliver -> **DELETE FROM SERVER**.

They delete it?

Why not? It's on your phone. It's on their phone.

The server only holds messages if the recipient is offline. The moment they connect... *woosh*. Gone.

This keeps the table sizes small. It keeps the RAM free. That's how they ran 450 million users with 32 engineers.

But wait. I've seen 'Restoring Media' when I switch phones. The data has to be somewhere.

Ah, that.

That's stored in a blob bucket (like S3/Google Drive) linked to *your* account backup. It's not in the hot path of the message router.

The router is pure speed. **Mnesia**.

Mnesia?

Erlang's built-in database. It runs *inside* the same memory space as the app.

No SQL query over the network. It's just a memory lookup.

But here is where it gets crazy. The **Distribution Layer**.

Erlang nodes talk to each other natively. You don't need a load balancer. You don't need Redis Pub/Sub.

Node A just says `Send to Process ID <0.42.0> on Node B`. And it happens.

So... no Kubernetes? No Kafka? No API Gateway?

Just code that understands the network.

Rick Reed and the team spent years tuning **FreeBSD** to make this work.

Default TCP listen queue is 128 connections. They bumped it to thousands.

Every socket is a file. The OS limits you to a few thousand open files. They cranked it to millions.

They had to patch the OS kernel to stop it from panicking under the load.

Why doesn't everyone do this? Why do we use all this heavy cloud stuff?

Because Erlang is hard. The syntax looks like Prolog. It has no loops, only recursion. Variables are immutable.

It scares people. It's easier to buy 500 servers from Amazon than to hire 5 engineers who understand how a CPU actually works.

So we trade efficiency for... comfort?

We trade efficiency for *mediocrity*.

WhatsApp wasn't built by average developers. It was built by people who cared about every byte.

Now fix your recursive loop. You're blocking the main thread.