TLDR:

The deeper you look at accelerated computing, the more a simple truth emerges -- silicon alone is not the moat. The real lock on the market is how those chips talk to each other.

NVLink, NVIDIA's proprietary high-bandwidth interconnect, is the glue that turns otherwise discrete GPUs into one giant, coherent accelerator.

In the process, NVIDIA gains control over the most lucrative workloads in Artificial Intelligence (AI), High-performance computing (HPC), and visualization.

I try to distill a complex topic to the best that I can. We start by exploring the bandwidth bottlenecks that plague PCIe-based systems and how NVLink's point-to-point links and non-blocking NV Switch fabric rewrite the rules for multi-GPU scaling.¹

From there we dive into NVLink Chip-to-Chip (C2C), the Grace Hopper coherence layer that lets GPUs sip from CPU memory while burning roughly 25× less energy per bit than a PCIe hop.²

Alongside the architectural tour, we track the real-world consequences -- faster LLM training runs, shared VRAM pools for cinematic renders, and scientific simulations that no longer crawl.³

Section	Key Topics Discussed
TLDR	NVLink as NVIDIA's real competitive moat; interconnect, not just silicon or CUDA.
Intro: The Bottleneck	Importance of high-speed interconnects; CPU/GPU communication dynamics; limitations of PCIe; demand for bandwidth and low latency.
What Is NVLink?	Overview of NVLink architecture; proprietary point-to-point technology; comparison with PCIe; basic hardware and topology principles.
Mesh Topologies and Switch Fabric	Switchless mesh connections; role of NV Switch; NVLink 5.0 switch bandwidth; enabling GPU scale-up and multi-GPU systems.
NVLink C2C & Unified Memory	Grace Hopper NVLink Chip-to-Chip (C2C); CPU–GPU coherence; unified and expanded memory access; efficiency and power savings.
Real-World Impact	Effects on collective operations; accelerated AI training; dynamic GPU resource pooling; software adaptation; monopoly dynamics.
Interconnect Comparison	NVLink vs. PCIe, CXL, and AMD Infinity Fabric; evaluation of bandwidth, latency, openness, and vendor lock-in.
Evolution and Roadmap	Progression from NVLink 1.0 to 5.0; Fusion technology and licensing; positioning NVLink as an industry standard.
Conclusion	Interconnect speed as the core advantage; NVLink's dominance at the system level; outlook for heterogeneous computing.

Term	What It Means	Why It Matters in AI/HPC
NVLink	NVIDIA’s proprietary high-speed point-to-point interconnect for GPUs and CPUs.	Enables ultra-fast, direct communication and memory sharing between chips, powering large-scale AI/ML workloads.
PCIe (PCI Express)	Industry standard communication bus for connecting peripherals (like GPUs, SSDs) to CPUs.	Ubiquitous and open, but has lower bandwidth and higher latency vs. NVLink. Bottleneck in multi-GPU or large-model setups.
CXL (Compute Express Link)	Open interconnect standard designed for low-latency memory sharing between CPUs, GPUs, FPGAs, and accelerators.	Emerging competitor for next-gen, coherent heterogenous computing. Aims to unify memory across devices from different vendors.
Infinity Fabric	AMD’s high-speed interconnect technology for connecting CPUs and GPUs.	Enables multi-GPU and CPU-GPU collaboration; alternative to NVLink in AMD systems.
Bandwidth	Amount of data that can be transferred per second (e.g., GB/s, TB/s).	Limits the speed of data movement; higher is better for large-scale AI/ML.
Latency	Time taken for data to travel from source to destination.	Lower latency means faster communication, critical for distributed training.
Switch Fabric/NV Switch	Specialized hardware enabling many chips to connect and communicate simultaneously at full speed.	Removes bottlenecks when scaling beyond small mesh topologies.
Chip-to-Chip (C2C)	Direct connection between CPU and GPU without routing over PCIe.	Grace Hopper NVLink C2C enables unified, coherent memory sharing between CPU and GPU.
TCO (Total Cost of Ownership)	The comprehensive sum of buying, deploying, maintaining, and powering technology over its usable life.	Lower TCO is critical in data centers; specialized interconnects may raise or lower it depending on efficiency and vendor lock-in.
SerDes (Serializer/Deserializer)	Hardware that converts data between serial and parallel forms for transmission over high-speed links.	Essential for moving data rapidly across interconnects like NVLink, PCIe, and CXL with minimal signal loss.
Vendor Lock-in	A limitation from using proprietary technology tied to a specific company’s ecosystem.	NVLink delivers major advantages but only on NVIDIA hardware; open standards seek to avoid this lock-in.

We compare NVLink with other chip interconnects, like PCIe Gen5, CXL, and AMD's Infinity Fabric. The focus: which moves data fastest, which is lowest in latency, and which solutions tie you to a single vendor versus keeping your options open?

Then we follow how NVLink has changed from version 1.0 to 5.0, and how NVIDIA is now letting others use its new NVLink Fusion technology -- hoping to make it the standard way chips connect in mixed systems.

If you buy the argument that interconnect speed defines what problems you can solve, NVLink stops being a specs table entry. It becomes the real monopoly.

> be me doing deep dive on NVIDIA NVLink

> Wall Street who got lucky with NVDA: "CUDA is NVIDIA's moat!"

> 16 yo penny-pinching AI startup: "Maybe I should buy AMD chips at 40% discount..."

> expert: "You *can* port CUDA kernels if you really want, it's just tedious."

> expert: "also, your big tensors crawl across PCIe, slowing everything down"

> rack full of top-tier GPUs, but only one rickety bridge connecting them

> $40k GPU spends half its life bored, waiting for data to arrive

> install NVLink bridges

> suddenly, the entire rack behaves like one massive, hungry GPU

> computation never starves, everything flows at full speed

> real secret: it's not about CUDA -- it's about who owns the interconnect fabric

> TFW you realise CUDA was never the real lock-in

Chapter 1: What Is NVLink?

NVLink is NVIDIA's proprietary, high-speed, point-to-point interconnect designed to link GPUs (and increasingly CPUs) into a unified complex that behaves as a single accelerator.

Instead of routing traffic through the CPU's PCI Express (PCIe) root complex, NVLink creates direct peer-to-peer paths between devices. Each "link" is a bundle of high-speed differential lanes that aggregate into a mezzanine-style bridge between neighboring GPUs.

By the time you reach Blackwell-generation silicon, an NVLink endpoint can sustain up to 900 GB/s of bidirectional bandwidth per GPU without touching the PCIe fabric.

From Pascal through Blackwell, NVLink has followed a deliberate doubling cadence that keeps PCIe -- and anything built on top of it -- permanently behind. The table below recaps the progression alongside the PCIe era each generation leapfrogged.

NVLink Generation	GPU Architecture	Year	Total Bidirectional Bandwidth	Performance Trend
1.0	Pascal P100	2016	160 GB/s	≈5× PCIe 3.0 x16
2.0	Volta V100	2017	300 GB/s	≈2× Gen 1.0
3.0	Ampere A100	2020	600 GB/s	≈2× Gen 2.0
4.0	Hopper H100	2022	900 GB/s	≈1.5× Gen 3.0
5.0	Blackwell B200	2024	1.8 TB/s	≈2× Gen 4.0

The jump to 1.8 TB/s keeps a ≈14× headroom over PCIe Gen5 x16 (~128 GB/s), which is why trillion-parameter LLM training still requires NVLink-class collectives rather than commodity buses.

Switchless mesh, switch fabric, and scale

NVLink started out connecting GPUs together in simple mesh or ring layouts -- each GPU had direct connections to a few others, so they could send data straight across without waiting their turn on the PCIe bus.

Later, NVIDIA added a hardware switch (NV Switch) that let lots of GPUs talk to each other at once, instead of just in small groups. For example, one NV Switch can link up to 16 GPUs in a box, and all of them can share data at full NVLink speed without getting in each other's way.

This architectural choice solves three headaches endemic to PCIe-based clusters:

Bandwidth: PCIe 5.0 x16 gives about 128 GB/s both ways; NVLink gives much more, so GPUs don't have to wait for data.⁴
Latency: NVLink skips the CPU and is much faster inside the server -- delays drop to just a few microseconds.
Coherence: NVLink keeps memory in sync between GPUs, so code can access GPU or CPU memory directly instead of copying data around.⁵

Beyond GPU-to-GPU: NVLink C2C

NVLink C2C (Chip-to-Chip) lets a Grace CPU and Hopper GPU connect directly, sharing memory at much higher speed and lower energy use than PCIe.

This means a GPU can use the CPU's memory when it runs out of local VRAM, without a big slowdown. So, bigger models can run without running into memory or power problems.

Why it matters

With NVLink, all the GPUs and CPUs act like one big team instead of isolated parts. This means the software can use all the hardware together, sharing memory and data fast. Important operations for AI training, like all-reduce and broadcasting, run quickly and smoothly. This leads to faster training, less wasted power, and easier programming.

NVIDIA's real advantage isn't just their software -- it's the speed, low lag, and efficiency of NVLink, which others haven't matched yet.

> be me looking inside a DGX computer

> every GPU is connected to its neighbor using NVLink bridges

> each bridge moves a lot of data very quickly, no need for CPU help

> NV Switch connects all the bridges so data never has to wait

> all sixteen GPUs work together like one big team

> Grace Hopper's NVLink C2C connects the CPU to this group

> now, GPUs can share each other's and the CPU's memory easily

> most people think NVLink is just more PCIe, but it's actually its own special network for GPUs

> TFW the "extra PCIe lane" idea is just wrong when you see the real connections

PM has entered the chat

ML engineer has entered the chat

ok, I don't really get what NVLink actually does. Isn't it just another connector?

Not exactly! NVLink is like a private, super-fast network just for GPUs. With PCIe, GPUs have to go through the CPU to talk to each other.

With NVLink, they get direct high-speed links straight to their neighbors.

So what's the deal with this NV Switch? Is it just a bigger router or something?

Pretty much, the NV Switch brings all those NVLink connections together.

That way, up to sixteen GPUs can send like 14 terabytes per second between each other without any traffic jams.

And why do we keep hearing about Grace Hopper everywhere?

That's NVLink C2C

chip-to-chip.

It lets the Grace CPU and Hopper GPU share the same memory fabric. So, the GPU can access system memory with really low latency and a lot more efficiency than PCIe.

One last thing. Everyone online says CUDA is the real secret sauce. Is that really true?

CUDA is super important, but honestly, the big challenge is moving data quickly and keeping everything in sync. That's the hard part.

NVLink is what makes our accelerators scale smoothly, while others get stuck when a bunch of GPUs need to communicate.

Chapter 2: Performance, Latency, and Power Efficiency

The economic penalty for skipping NVLink shows up long before a purchase order clears. Data center operators who wire their clusters with commodity PCIe interconnects discover two brutal truths:

scaling stalls well below linear
every idle microsecond bleeds watts.

2.1 The Energy Penalty

NVLink is built for fast, efficient data transfer; PCIe uses more energy per bit to move the same data.⁶ When transfers are slow, GPUs sit idle but still use power, so you pay extra both for the inefficient transfer and for hardware doing nothing. On big clusters, energy waste per useful compute skyrockets.

Cost Metric	NVLink-Enabled Cluster	PCIe-Only Cluster	Financial Impact
Training Time	Baseline (X days)	X + 50% or more	Slower releases, higher burn rate
Power Efficiency	Low W/Useful TFLOP	~5× higher W/TFLOP	Larger monthly OPEX for the same job
Scaling Efficiency	~85%	~65%	CapEx wasted on underutilized hardware

2.2 The TCO Calculation -- The Monopoly in Practice

At first, picking PCIe instead of NVLink seems like a smart cost-saving move. PCIe hardware is much cheaper up front, and the specs sound fast enough. But as soon as you start training a big model, the real costs show up.

Here's what happens:

Slower Training: PCIe can't move data fast enough between GPUs, so jobs take way longer to finish. This means you wait extra days -- or even weeks -- to reach your goals, and miss out on new revenue or research because everything slips.
Wasted Hardware: While GPUs are waiting to send or receive data, they burn power but don't actually compute. So you pay top dollar for GPUs that sit idle for a big chunk of time.
Huge Power Bills: Longer training + idling GPUs = much higher electricity bills. The more you try to save on hardware, the more money you lose on operations.

Add it all up, and the “discount” vanishes fast. Over just a few projects, paying less up front can actually cost you more than the pricey NVLink setup. It's a hidden “tax”: wasted time, wasted energy, wasted hardware.

Here's the twist -- everyone says NVIDIA has a monopoly because of CUDA, but that's not really the trap. The real moat is NVLink. Their fabric unlocks performance you just can't get with PCIe, so you end up paying NVIDIA's prices whether you like it or not.

That's what makes this monopoly so deceptive: it hides in your power bills and your lost time, not just in hardware invoices.

> be me building two training pods side by side

> pod A: PCIe everywhere

> collective ops crawl, scaling plateaus, GPUs idle

> pod B: NVLink mesh with NV Switch

> latency drops into single digits, efficiency stays high

> Grace Hopper C2C unlocks another 480 GB of coherent memory

> NVLink cluster finishes epochs sooner and spends less on power

> Conclusion? cheap hardware without fast fabric just subsidizes wasted cycles

> TFW you spend more money on OpEx trying to make a discount rack work

We've used PCIe for storage and network for years -- why isn't it enough for AI training too?

Think of AI training as a team in non-stop direct messaging. PCIe runs every message through a middleman (the CPU)

NVLink is like letting GPUs DM each other directly -- much faster and no detours, so everyone keeps up.

But what about when model size blows past VRAM limits?

That's where NVLink C2C comes in. It lets the CPU and GPU share the same memory pool, so you can access system RAM at high speed without new code.

Its basically plug and play for big models.

Everyone says power costs pile up -- how much does the link type really matter?

PCIe makes GPUs wait, but they still burn power doing nothing. With NVLink, data moves so fast you finish jobs sooner and save big on energy bills.

So going budget with PCIe-only GPUs isn't actually saving us?

yeah

Any hardware savings get eaten by waste, idle GPUs, lost efficiency, and higher OPEX.

In the end, the budget GPUs just costs more in the long run.

Chapter 3: The Future of the Monopoly -- NVLink Fusion vs. CXL

Having won the present on raw performance, Nvidia can harden its moat by licensing the fabric selectively and turning NVLink racks into the default blueprint for hyperscale AI factories.⁷

3.1 NVLink Fusion

NVLink Fusion is NVIDIA's new initiative that lets selected partners connect their own custom chips -- like CPUs, AI accelerators, or other devices -- directly into NVIDIA's ultra-high-speed GPU fabric using special NVLink chiplets and interface blocks.

Think of NVLink Fusion as an adapter that allow third-party silicon to join NVIDIA's super-fast hardware club, but on NVIDIA's terms.

Normally, different chips in a server talk to each other using standard connections like PCIe, which is much slower and struggles to keep up with AI workloads.

NVLink Fusion gives partners access to pieces of the NVLink technology -- such as chiplets (tiny, modular hardware blocks), SerDes (high-speed data links), and the rack-wide switch that ties everything together.

But crucially, NVIDIA does not hand over the full control of the underlying protocol, keeping the essential DNA of NVLink proprietary.⁸

Aspect	Description
Core Benefit	NVLink Fusion lets custom hardware plug directly and efficiently into NVIDIA's GPUs, gaining the speed, low latency, and shared memory access normally reserved for NVIDIA hardware.
Control	NVIDIA determines who gets access and to what depth, preserving its control over the most valuable part of the stack.
Tight Coupling (Third-Party Silicon)	MediaTek, Marvell, and others can integrate CPUs or accelerators directly with Blackwell-class GPUs over NVLink C2C, inheriting 900 GB/s coherent links -- without needing to design a competing fabric.⁹
Rack Lock-In	NVIDIA's GB200 NVL72 reference rack positions NVLink as the core of the system -- custom CPUs aiming for that scale must use NVLink backplanes, not PCIe.⁷
Partner Dependency	Samsung Foundry and others creating non-x86 CPUs/XPUs are still anchored to NVLink for high-performance scale-up, even as customers pursue semi-custom solutions.¹⁰
Ecosystem Impact	Fusion expands ecosystem participation, while ensuring that top-tier training workloads remain centered on NVIDIA GPUs. ¹¹

3.2 The Challenger: CXL

CXL (Compute Express Link) is an industry-standard, high-speed interconnect protocol designed for coherently linking CPUs, GPUs, FPGAs, accelerators, and memory devices within servers and data centers.

It rides atop the PCIe physical layer (currently Gen5 and Gen6), but adds advanced memory coherency, direct load/store access, and fine-grained resource sharing between heterogeneous devices.

Unlike PCIe, which is optimized for general I/O and discrete transfers, CXL excels at pooling memory and connecting devices in a more unified, flexible memory space—allowing system RAM, persistent memory, and accelerator memory to be shared or accessed transparently by multiple components.

This lets disaggregated servers and accelerators dynamically compose resources for workloads, particularly in cloud and inference environments, with much more flexibility than legacy architectures.

What's this CXL thing I keep hearing about? Is it supposed to challenge NVLink?

CXL (Compute Express Link) is an open standard that rides on top of PCIe 5.0.

It lets CPUs, GPUs, accelerators, and even memory expanders pool memory and talk to each other almost like they're part of one big machine.

Sounds flexible -- so I could toss lots of different hardware into a rack and just have it all share memory?

yeaah CXL is great for "scale-out" setups

server farms, cloud clusters, and data centers running lots of smaller inference jobs.

You can add RAM, CPUs, or extra accelerators from various vendors, and CXL lets them all see and use memory together, efficiently.

Okay, but if it's so good, what does this have to do with NVIDIA's monopoly and NVLink?

well, for the really big "scale-up" clusters -- the kind that train trillion-parameter models -- CXL can't keep up.

its still bottlenecked by PCIe bandwidth (about 128 GB/s per x16 slot)

its latency is also much higher -- usually tens of microseconds, compared to NVLink's single-digit microseconds.

So CXL is awesome for making CPUs and standard memory go further, but it's just not fast enough for top-end GPU training?

yeaah, CXL transforms how data centers pool resources for inference and general compute, but when it comes to the ultra-demanding, bandwidth-hungry world of flagship AI training, only NVLink delivers.

That high-speed, low-latency fabric is really what lets NVIDIA control the most valuable segment of the market.

4.3 The Enduring Monopoly

NVLink's future is a two-track play:

invite partners into the rack while keeping the only highway that makes the rack economic.¹²
Fusion taps custom silicon, CXL absorbs the commodity edge
NVIDIA keeps the 14× bandwidth differential that dictates performance-per-watt in training clusters.

> be me watching hyperscalers eye custom CPUs

> NVIDIA hands them NVLink Fusion chiplets, not the protocol

> partner silicon bolts in, but the rack, switch, and SerDes all wearing Jensen Huang's leather jacket

> CXL shows up with pooled DDR dreams and 128 GB/s reality

> training team runs the math: 1.8 TB/s vs 128 GB/s, microseconds vs tens of microseconds

> conclusion: NVLink stays mandatory, CXL handles the commodity edge

> TFW openness means everyone plugs into NVIDIA anyway

If NVIDIA is licensing NVLink, doesn't that dilute the monopoly?

Fusion lets partners bolt on custom silicon, but the high-performance fabric still belongs to NVIDIA.

Can CXL replace NVLink for training?

Not today. CXL 3.0 rides PCIe, so bandwidth and latency land an order of magnitude behind NVLink's SerDes.

Then where does CXL fit?

Memory pooling, inference clusters, and heterogeneous CPU fabrics. It's perfect for scale-out, not the scale-up GPU core.

So the moat survives?

Yes. The only economically viable path for massive training remains NVLink plus NVSwitch. Fusion just makes sure everyone else plugs into it instead of fighting it.

Conclusion

If you strip away the marketing wars and the surface-level focus on flops, CUDA, or even raw chip supply, the reason NVIDIA dominates AI isn’t just better silicon or clever software lock-in -- Its NVLink

The connective tissue that lets every dollar spent on compute deliver maximum throughput, minimal waiting, and scale that simply isn't feasible with PCIe or open standards alone.

At the start, I argued that interconnect speed defines what problems a datacenter can solve.

After examining the arc from PCIe bottlenecks to NVLink’s peer-to-peer fabric, and from NVSwitch and C2C to Fusion, the pattern is clear.

As other vendors talk about “openness,” NVIDIA has built a network that bakes their advantage right into the server rack. It’s not just about making their own GPUs faster; it’s about making any alternative uneconomical at serious scale.

While headlines cite CUDA as the moat, NVLink is the underwater reef that breaks any vessel not designed for NVIDIA’s ocean.

No competitor -- not CXL, not PCIe, not open accelerators -- can yet offer the seamless, ultra-low-latency, and high-bandwidth connections required for the largest AI models (and the biggest AI customers).

By licensing Fusion and letting others “plug in,” NVIDIA appears open, but the only highway that makes hyperscale AI profitable is still theirs.

As AI gets hungrier for bandwidth and systems grow ever larger, every new rack, every new standard, reinforces this advantage.

Even if you swap out the CPU or buy next year’s “open” accelerator, your data will still flow through NVLink’s lanes, routed by NVIDIA’s own switch fabric.

The AI race isn’t just about who makes the fastest chip -- its now about who owns the roads connecting them.

Wait so its all Nvidia???

Always has been

NVLink is NVIDIA's Real Monopoly