TLDR:
The deeper you look at accelerated computing, the more a simple truth emerges -- silicon alone is not the moat. The real lock on the market is how those chips talk to each other.
NVLink, NVIDIA's proprietary high-bandwidth interconnect, is the glue that turns otherwise discrete GPUs into one giant, coherent accelerator.
In the process, NVIDIA gains control over the most lucrative workloads in Artificial Intelligence (AI), High-performance computing (HPC), and visualization.
I try to distill a complex topic to the best that I can. We start by exploring the bandwidth bottlenecks that plague PCIe-based systems and how NVLink's point-to-point links and non-blocking NV Switch fabric rewrite the rules for multi-GPU scaling.1
From there we dive into NVLink Chip-to-Chip (C2C), the Grace Hopper coherence layer that lets GPUs sip from CPU memory while burning roughly 25× less energy per bit than a PCIe hop.2
Alongside the architectural tour, we track the real-world consequences -- faster LLM training runs, shared VRAM pools for cinematic renders, and scientific simulations that no longer crawl.3
Table of Contents
| Section | Key Topics Discussed |
|---|---|
| TLDR | NVLink as NVIDIA's real competitive moat; interconnect, not just silicon or CUDA. |
| Intro: The Bottleneck | Importance of high-speed interconnects; CPU/GPU communication dynamics; limitations of PCIe; demand for bandwidth and low latency. |
| What Is NVLink? | Overview of NVLink architecture; proprietary point-to-point technology; comparison with PCIe; basic hardware and topology principles. |
| Mesh Topologies and Switch Fabric | Switchless mesh connections; role of NV Switch; NVLink 5.0 switch bandwidth; enabling GPU scale-up and multi-GPU systems. |
| NVLink C2C & Unified Memory | Grace Hopper NVLink Chip-to-Chip (C2C); CPU–GPU coherence; unified and expanded memory access; efficiency and power savings. |
| Real-World Impact | Effects on collective operations; accelerated AI training; dynamic GPU resource pooling; software adaptation; monopoly dynamics. |
| Interconnect Comparison | NVLink vs. PCIe, CXL, and AMD Infinity Fabric; evaluation of bandwidth, latency, openness, and vendor lock-in. |
| Evolution and Roadmap | Progression from NVLink 1.0 to 5.0; Fusion technology and licensing; positioning NVLink as an industry standard. |
| Conclusion | Interconnect speed as the core advantage; NVLink's dominance at the system level; outlook for heterogeneous computing. |
Key Glossary: Interconnects and Related Jargon
| Term | What It Means | Why It Matters in AI/HPC |
|---|---|---|
| NVLink | NVIDIA’s proprietary high-speed point-to-point interconnect for GPUs and CPUs. | Enables ultra-fast, direct communication and memory sharing between chips, powering large-scale AI/ML workloads. |
| PCIe (PCI Express) | Industry standard communication bus for connecting peripherals (like GPUs, SSDs) to CPUs. | Ubiquitous and open, but has lower bandwidth and higher latency vs. NVLink. Bottleneck in multi-GPU or large-model setups. |
| CXL (Compute Express Link) | Open interconnect standard designed for low-latency memory sharing between CPUs, GPUs, FPGAs, and accelerators. | Emerging competitor for next-gen, coherent heterogenous computing. Aims to unify memory across devices from different vendors. |
| Infinity Fabric | AMD’s high-speed interconnect technology for connecting CPUs and GPUs. | Enables multi-GPU and CPU-GPU collaboration; alternative to NVLink in AMD systems. |
| Bandwidth | Amount of data that can be transferred per second (e.g., GB/s, TB/s). | Limits the speed of data movement; higher is better for large-scale AI/ML. |
| Latency | Time taken for data to travel from source to destination. | Lower latency means faster communication, critical for distributed training. |
| Switch Fabric/NV Switch | Specialized hardware enabling many chips to connect and communicate simultaneously at full speed. | Removes bottlenecks when scaling beyond small mesh topologies. |
| Chip-to-Chip (C2C) | Direct connection between CPU and GPU without routing over PCIe. | Grace Hopper NVLink C2C enables unified, coherent memory sharing between CPU and GPU. |
| TCO (Total Cost of Ownership) | The comprehensive sum of buying, deploying, maintaining, and powering technology over its usable life. | Lower TCO is critical in data centers; specialized interconnects may raise or lower it depending on efficiency and vendor lock-in. |
| SerDes (Serializer/Deserializer) | Hardware that converts data between serial and parallel forms for transmission over high-speed links. | Essential for moving data rapidly across interconnects like NVLink, PCIe, and CXL with minimal signal loss. |
| Vendor Lock-in | A limitation from using proprietary technology tied to a specific company’s ecosystem. | NVLink delivers major advantages but only on NVIDIA hardware; open standards seek to avoid this lock-in. |
We compare NVLink with other chip interconnects, like PCIe Gen5, CXL, and AMD's Infinity Fabric. The focus: which moves data fastest, which is lowest in latency, and which solutions tie you to a single vendor versus keeping your options open?
Then we follow how NVLink has changed from version 1.0 to 5.0, and how NVIDIA is now letting others use its new NVLink Fusion technology -- hoping to make it the standard way chips connect in mixed systems.
If you buy the argument that interconnect speed defines what problems you can solve, NVLink stops being a specs table entry. It becomes the real monopoly.
Chapter 1: What Is NVLink?
NVLink is NVIDIA's proprietary, high-speed, point-to-point interconnect designed to link GPUs (and increasingly CPUs) into a unified complex that behaves as a single accelerator.
Instead of routing traffic through the CPU's PCI Express (PCIe) root complex, NVLink creates direct peer-to-peer paths between devices. Each "link" is a bundle of high-speed differential lanes that aggregate into a mezzanine-style bridge between neighboring GPUs.
By the time you reach Blackwell-generation silicon, an NVLink endpoint can sustain up to 900 GB/s of bidirectional bandwidth per GPU without touching the PCIe fabric.
From Pascal through Blackwell, NVLink has followed a deliberate doubling cadence that keeps PCIe -- and anything built on top of it -- permanently behind. The table below recaps the progression alongside the PCIe era each generation leapfrogged.
| NVLink Generation | GPU Architecture | Year | Total Bidirectional Bandwidth | Performance Trend |
|---|---|---|---|---|
| 1.0 | Pascal P100 | 2016 | 160 GB/s | ≈5× PCIe 3.0 x16 |
| 2.0 | Volta V100 | 2017 | 300 GB/s | ≈2× Gen 1.0 |
| 3.0 | Ampere A100 | 2020 | 600 GB/s | ≈2× Gen 2.0 |
| 4.0 | Hopper H100 | 2022 | 900 GB/s | ≈1.5× Gen 3.0 |
| 5.0 | Blackwell B200 | 2024 | 1.8 TB/s | ≈2× Gen 4.0 |
The jump to 1.8 TB/s keeps a ≈14× headroom over PCIe Gen5 x16 (~128 GB/s), which is why trillion-parameter LLM training still requires NVLink-class collectives rather than commodity buses.
Switchless mesh, switch fabric, and scale
NVLink started out connecting GPUs together in simple mesh or ring layouts -- each GPU had direct connections to a few others, so they could send data straight across without waiting their turn on the PCIe bus.
Later, NVIDIA added a hardware switch (NV Switch) that let lots of GPUs talk to each other at once, instead of just in small groups. For example, one NV Switch can link up to 16 GPUs in a box, and all of them can share data at full NVLink speed without getting in each other's way.
This architectural choice solves three headaches endemic to PCIe-based clusters:
- Bandwidth: PCIe 5.0 x16 gives about 128 GB/s both ways; NVLink gives much more, so GPUs don't have to wait for data.4
- Latency: NVLink skips the CPU and is much faster inside the server -- delays drop to just a few microseconds.
- Coherence: NVLink keeps memory in sync between GPUs, so code can access GPU or CPU memory directly instead of copying data around.5
Beyond GPU-to-GPU: NVLink C2C
NVLink C2C (Chip-to-Chip) lets a Grace CPU and Hopper GPU connect directly, sharing memory at much higher speed and lower energy use than PCIe.
This means a GPU can use the CPU's memory when it runs out of local VRAM, without a big slowdown. So, bigger models can run without running into memory or power problems.
Why it matters
With NVLink, all the GPUs and CPUs act like one big team instead of isolated parts. This means the software can use all the hardware together, sharing memory and data fast. Important operations for AI training, like all-reduce and broadcasting, run quickly and smoothly. This leads to faster training, less wasted power, and easier programming.
NVIDIA's real advantage isn't just their software -- it's the speed, low lag, and efficiency of NVLink, which others haven't matched yet.
Chapter 2: Performance, Latency, and Power Efficiency
The economic penalty for skipping NVLink shows up long before a purchase order clears. Data center operators who wire their clusters with commodity PCIe interconnects discover two brutal truths:
- scaling stalls well below linear
- every idle microsecond bleeds watts.
2.1 The Energy Penalty
NVLink is built for fast, efficient data transfer; PCIe uses more energy per bit to move the same data.6 When transfers are slow, GPUs sit idle but still use power, so you pay extra both for the inefficient transfer and for hardware doing nothing. On big clusters, energy waste per useful compute skyrockets.
| Cost Metric | NVLink-Enabled Cluster | PCIe-Only Cluster | Financial Impact |
|---|---|---|---|
| Training Time | Baseline (X days) | X + 50% or more | Slower releases, higher burn rate |
| Power Efficiency | Low W/Useful TFLOP | ~5× higher W/TFLOP | Larger monthly OPEX for the same job |
| Scaling Efficiency | ~85% | ~65% | CapEx wasted on underutilized hardware |
2.2 The TCO Calculation -- The Monopoly in Practice
At first, picking PCIe instead of NVLink seems like a smart cost-saving move. PCIe hardware is much cheaper up front, and the specs sound fast enough. But as soon as you start training a big model, the real costs show up.
Here's what happens:
- Slower Training: PCIe can't move data fast enough between GPUs, so jobs take way longer to finish. This means you wait extra days -- or even weeks -- to reach your goals, and miss out on new revenue or research because everything slips.
- Wasted Hardware: While GPUs are waiting to send or receive data, they burn power but don't actually compute. So you pay top dollar for GPUs that sit idle for a big chunk of time.
- Huge Power Bills: Longer training + idling GPUs = much higher electricity bills. The more you try to save on hardware, the more money you lose on operations.
Add it all up, and the “discount” vanishes fast. Over just a few projects, paying less up front can actually cost you more than the pricey NVLink setup. It's a hidden “tax”: wasted time, wasted energy, wasted hardware.
Here's the twist -- everyone says NVIDIA has a monopoly because of CUDA, but that's not really the trap. The real moat is NVLink. Their fabric unlocks performance you just can't get with PCIe, so you end up paying NVIDIA's prices whether you like it or not.
That's what makes this monopoly so deceptive: it hides in your power bills and your lost time, not just in hardware invoices.
Chapter 3: The Future of the Monopoly -- NVLink Fusion vs. CXL
Having won the present on raw performance, Nvidia can harden its moat by licensing the fabric selectively and turning NVLink racks into the default blueprint for hyperscale AI factories.7
3.1 NVLink Fusion
NVLink Fusion is NVIDIA's new initiative that lets selected partners connect their own custom chips -- like CPUs, AI accelerators, or other devices -- directly into NVIDIA's ultra-high-speed GPU fabric using special NVLink chiplets and interface blocks.
Think of NVLink Fusion as an adapter that allow third-party silicon to join NVIDIA's super-fast hardware club, but on NVIDIA's terms.
Normally, different chips in a server talk to each other using standard connections like PCIe, which is much slower and struggles to keep up with AI workloads.
NVLink Fusion gives partners access to pieces of the NVLink technology -- such as chiplets (tiny, modular hardware blocks), SerDes (high-speed data links), and the rack-wide switch that ties everything together.
But crucially, NVIDIA does not hand over the full control of the underlying protocol, keeping the essential DNA of NVLink proprietary.8
| Aspect | Description |
|---|---|
| Core Benefit | NVLink Fusion lets custom hardware plug directly and efficiently into NVIDIA's GPUs, gaining the speed, low latency, and shared memory access normally reserved for NVIDIA hardware. |
| Control | NVIDIA determines who gets access and to what depth, preserving its control over the most valuable part of the stack. |
| Tight Coupling (Third-Party Silicon) | MediaTek, Marvell, and others can integrate CPUs or accelerators directly with Blackwell-class GPUs over NVLink C2C, inheriting 900 GB/s coherent links -- without needing to design a competing fabric.9 |
| Rack Lock-In | NVIDIA's GB200 NVL72 reference rack positions NVLink as the core of the system -- custom CPUs aiming for that scale must use NVLink backplanes, not PCIe.7 |
| Partner Dependency | Samsung Foundry and others creating non-x86 CPUs/XPUs are still anchored to NVLink for high-performance scale-up, even as customers pursue semi-custom solutions.10 |
| Ecosystem Impact | Fusion expands ecosystem participation, while ensuring that top-tier training workloads remain centered on NVIDIA GPUs. 11 |
3.2 The Challenger: CXL
CXL (Compute Express Link) is an industry-standard, high-speed interconnect protocol designed for coherently linking CPUs, GPUs, FPGAs, accelerators, and memory devices within servers and data centers.
It rides atop the PCIe physical layer (currently Gen5 and Gen6), but adds advanced memory coherency, direct load/store access, and fine-grained resource sharing between heterogeneous devices.
Unlike PCIe, which is optimized for general I/O and discrete transfers, CXL excels at pooling memory and connecting devices in a more unified, flexible memory space—allowing system RAM, persistent memory, and accelerator memory to be shared or accessed transparently by multiple components.
This lets disaggregated servers and accelerators dynamically compose resources for workloads, particularly in cloud and inference environments, with much more flexibility than legacy architectures.
4.3 The Enduring Monopoly
NVLink's future is a two-track play:
- invite partners into the rack while keeping the only highway that makes the rack economic.12
- Fusion taps custom silicon, CXL absorbs the commodity edge
- NVIDIA keeps the 14× bandwidth differential that dictates performance-per-watt in training clusters.
Conclusion
If you strip away the marketing wars and the surface-level focus on flops, CUDA, or even raw chip supply, the reason NVIDIA dominates AI isn’t just better silicon or clever software lock-in -- Its NVLink
The connective tissue that lets every dollar spent on compute deliver maximum throughput, minimal waiting, and scale that simply isn't feasible with PCIe or open standards alone.
At the start, I argued that interconnect speed defines what problems a datacenter can solve.
After examining the arc from PCIe bottlenecks to NVLink’s peer-to-peer fabric, and from NVSwitch and C2C to Fusion, the pattern is clear.
As other vendors talk about “openness,” NVIDIA has built a network that bakes their advantage right into the server rack. It’s not just about making their own GPUs faster; it’s about making any alternative uneconomical at serious scale.
While headlines cite CUDA as the moat, NVLink is the underwater reef that breaks any vessel not designed for NVIDIA’s ocean.
No competitor -- not CXL, not PCIe, not open accelerators -- can yet offer the seamless, ultra-low-latency, and high-bandwidth connections required for the largest AI models (and the biggest AI customers).
By licensing Fusion and letting others “plug in,” NVIDIA appears open, but the only highway that makes hyperscale AI profitable is still theirs.
As AI gets hungrier for bandwidth and systems grow ever larger, every new rack, every new standard, reinforces this advantage.
Even if you swap out the CPU or buy next year’s “open” accelerator, your data will still flow through NVLink’s lanes, routed by NVIDIA’s own switch fabric.
The AI race isn’t just about who makes the fastest chip -- its now about who owns the roads connecting them.
Footnotes
-
What are the Key Differences Between NVLink and PCIe? | AI FAQ - Jarvis Labs ↩
-
NVLink vs PCIe: What's the Difference for AI Workloads - Hyperstack ↩
-
How does NVLink improve the performance of AI model training compared to PCIe? ↩
-
How does NVIDIA's NVLink impact cache coherence in multi-GPU systems? - Massed Compute ↩
-
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion ↩ ↩2
-
NVIDIA Unveils NVLink Fusion for Industry to Build Semi-Custom AI Infrastructure with NVIDIA Partner Ecosystem ↩
-
Reuters: Nvidia's Huang Set to Showcase Latest AI Tech at Taiwan's Computex ↩
-
TechRadar: Samsung Will Help Nvidia Build Custom Non-x86 CPUs and XPUs ↩
-
The Next Platform: Nvidia Licenses NVLink Memory Ports to CPU and Accelerator Makers ↩