EP08 - Quant Science

How quants help discover drugs. Learn about QSAR, molecular fingerprints, virtual screening, cheminformatics, and how quantitative methods accelerate pharmaceutical research.

Tweet coming soon

This is **Science**. Real Science.

Quants deal with money. Abstract numbers. They are just degen gamblers!

...this again?

Here, we deal with physical reality. You can't "Arbitrage" a virus. You can't "Short Sell" a protein.

You know there are 10,000 molecules in that tray you're holding, right?

I know! It's going to take me all week! But this is Drug Discovery. We have to test every chemical to see if it binds to the protein. It's **brute force science**!

Ah, the **Search Space Problem** If only quants had already figured out how to trade 10,000 different stocks in a day. Let's talk about QSAR.

Q-SAR? Quantitative Structure-Activity Relationship. It's just a regression, right? $Y = mX + b$?

Fundamentally, yes.

$X$ is the Molecule. $Y$ is the Potency (e.g., $IC_{50}$ - how much drug kills 50% of the virus).

There are $10^{60}$ possible drug-like molecules. There are not enough atoms in the universe to build a lab big enough to test them all.

QSAR is a filter. It creates a "Shortlist."

I can test 1 billion molecules inside the GPU in an hour.

The model predicts the "Profit" (Biological Activity) for each molecule. We throw away the losers and physically test only the top 10 winners.

Fine, a quant model can screen for molecules. But how does it predict its "Profit"? A molecule is a **3D Shape**. It has geometry. Bonds. Angles!

A computer model takes **Vectors**. Lists of numbers. How do you turn a drawing into a spreadsheet without losing the soul of the molecule?

We **Vectorize** it.

We create a **Molecular Fingerprint**.

Imagine standing on this Nitrogen atom. Look at your neighbors.

Radius 0: "I am Nitrogen."

Radius 1: "I am attached to Carbon and Carbon."

Radius 2: "I see a Double Bond Oxygen."

We take these fragments—these local environments—and we **Hash** them.

I-It now looks like machine code?

It is an **ECFP (Extended Connectivity Fingerprint)**.

So... what does "Column 42" mean?

Column 42 might correspond to "A Benzene Ring attached to a Nitrogen."

If the molecule has that feature, the bit is **1**. If not, it is **0**.

W-wait... It's a **Bag of Words**. Like in NLP used by quants to screen for earnings transcripts!

Exactly. "Benzene" is a word. The molecule is a sentence.

We turned chemistry into a language problem.

Okay, so that handles the *structure*. We have **binary features**. Yes/No.

But stocks have continuous features. "P/E Ratio" isn't a Yes/No. "Volatility" is a float.

How would quant's math for continuous variables work for biology?

Of course. The fingerprint is just the *shape*.

We also append **Physicochemical Descriptors** like "LogP" (Lipophilicity), "PSA" (Polar Surface Area), and "Molecular Weight". These are your floats.

**LogP**. It measures **Lipophilicity**. How much does the drug love fat vs. water?

High LogP = Loves Oil. Low LogP = Loves Water.

Why does that matter?

To get into a cell, the drug must cross the membrane (Oil).

If LogP is too low (loves water), it bounces off. It stays in the blood.

If LogP is too high (loves oil), it gets stuck in the fat and never leaves.

Next is **PSA** – Polar Surface Area. It measures the "water-loving" regions on the molecule: the parts that can form hydrogen bonds.

So, if LogP is about oil vs water, PSA is about… interaction with water?

Yes, PSA tells us if the molecule can slip through membranes or gets blocked because it loves water too much.

And then: **Molecular Weight**. Too small, and it might get flushed out before it works. Too large, and it might never get in.

So for a drug to work, all three dials have to be set "just right"?

Exactly. Too much or too little of any one—and the molecule fails as a medicine.

Because LogP, PSA, and Molecular Weight are continuous, we can use algorithms like **GMM** or **UMAP**.

To do what?

To find the **Safe Space**.

We cluster known drugs. We see a dense cloud in the middle.

This cloud is "Druglike." If your new molecule falls outside this cloud (e.g., LogP > 7), the model flags it as "High Risk."

W-Wait, that's the same as how quants group similar stocks and risk by class!

Yup. Stocks have Volatility, Liquidity, and Beta. Molecules have LogP, Molecular Weight, and Activity. They can fit into **Optimization Curves** and be grouped into classes using **GMM** and dimensionally reduced by **UMAP**.

But that's just statistics. You stay in the herd to be safe.

In chemistry, two things might look exactly the same, but one kills you.

Seems like you've done plenty of homework with chemistry. What you just described is the **Activity Cliff** problem.

Take these two molecules: Propylene glycol (CH₃CHCH₂OH) and Ethylene glycol ((CH₂OH)₂).

Propylene glycol. One extra methyl group. Safe enough to eat.

Ethylene glycol. Sweet. Colorless. Lethal.

Both molecules have a **Tanimoto similarity** > 0.9. Most fingerprints call them neighbors.

But the liver doesn't see fingerprints. It sees chemistry.

A Linear Model ($Y = wX$) will assume smoothness. "Small change in X = Small change in Y."

It will predict Molecule B is "Mostly Safe."

Then the model kills the patient!

This is why we use **Non-Linear Models**. Random Forests. Deep Neural Networks.

A Tree can learn a hard cut.

`IF (Similarity > 0.9) AND (Has_Methyl_Group == True) THEN TOXIC.`

It can memorize the exception and map the Cliff.

The same Non-Linear Models idea was used by quants to model financial events like circuit breakers, stop losses, and margin calls.

In both industries, before we even make any profit for ourselves, we must first try to not kill our customers.

Okay. So we need complex models for complex physics.

But how do we actually *run* this? What does a "Trading Day" look like for a biologist?

In finance, we have a universe of 3,000 stocks. In pharma, we have a universe of 1 billion molecules.

In both industries, we reduce the universe down to a manageable size.

Then, we simulate our truncated universe to find the best strategy or candidates.

What is Molecular Docking? Is it like a boat?

It's a **Physics Engine**.

We take the 3D model of the drug and try to jam it into the 3D model of the protein.

The simulation calculates the **Binding Energy** ($Delta G$).

It rotates the key, twists it, vibrates it. "Does it fit? Is it stable?"

This looks computationally expensive.

It is. It takes minutes per molecule. That's why we only Dock the survivors of the QSAR filter.

In both industries, we execute the best strategy or candidates.

So... each of these tube is like a trading strategy to validate?

Yes. The model gives you the Top 10 candidates. Now you test them in the real world.

If the model is good... one of these 10 should work?

Even if the Hit Rate is only 10%... that's 1 success in 10 tries. Compare that to your random guessing: 1 success in 10,000 tries.

Even if your patient hasn't got killed by the disease, they would have died from old age.

You increased my leverage by 1,000x.

I just improved your Sharpe Ratio. High Return (Effective Drug) / Low Risk (Wasted Lab Time)

`import rdkit`. `import deepchem`.

I'm downloading the ZINC database.

I could use this quant idea to run a High-Throughput Screening! I can test 100,000 molecules in a day!

Why do you think Renaissance Technologies (the best hedge fund) hires physicists and biologists instead of FinBros?

Because they know science?

Because filtering a stock trend from market noise is mathematically identical to finding a drug candidate in a chemical library.

Wait... if I discover a drug using code... can I patent it?

That's a legal question. But you can definitely trade the biotech stock before you publish the paper.

HAHAHA now THAT is a Quant move. I could get rich before my drug even gets approved.

Sharpe catch. That is Insider Trading. See you in court.

Well, that's a tail risk event I must hedge.