TLDR:
The deeper you look at accelerated computing, the more a simple truth emerges -- silicon alone is not the moat. The real lock on the market is how those chips talk to each other.
NVLink, NVIDIA's proprietary high-bandwidth interconnect, is the glue that turns otherwise discrete GPUs into one giant, coherent accelerator.
In the process, NVIDIA gains control over the most lucrative workloads in Artificial Intelligence (AI), High-performance computing (HPC), and visualization.
The monopoly is not in the "brain". Rather it is in the "spine"
I try to distill a complex topic to the best that I can.
We start by exploring the bandwidth bottlenecks that plague PCIe-based systems and how NVLink's point-to-point links and non-blocking NV Switch fabric rewrite the rules for multi-GPU scaling.1
From there we dive into NVLink Chip-to-Chip (C2C), the Grace Hopper coherence layer that lets GPUs sip from CPU memory while burning roughly 25× less energy per bit than a PCIe hop.2
Then, we track the real-world consequences -- faster LLM training runs, shared VRAM pools for cinematic renders, and scientific simulations that no longer crawl.3
Table of Contents
| Section | Key Topics Discussed |
|---|---|
| TLDR | NVLink as NVIDIA's real competitive moat; interconnect, not just silicon or CUDA. |
| Intro: The Bottleneck | Importance of high-speed interconnects; CPU/GPU communication dynamics; limitations of PCIe; demand for bandwidth and low latency. |
| What Is NVLink? | Overview of NVLink architecture; proprietary point-to-point technology; comparison with PCIe; basic hardware and topology principles. |
| Mesh Topologies and Switch Fabric | Switchless mesh connections; role of NV Switch; NVLink 5.0 switch bandwidth; enabling GPU scale-up and multi-GPU systems. |
| NVLink C2C & Unified Memory | Grace Hopper NVLink Chip-to-Chip (C2C); CPU–GPU coherence; unified and expanded memory access; efficiency and power savings. |
| Real-World Impact | Effects on collective operations; accelerated AI training; dynamic GPU resource pooling; software adaptation; monopoly dynamics. |
| Interconnect Comparison | NVLink vs. PCIe, CXL, and AMD Infinity Fabric; evaluation of bandwidth, latency, openness, and vendor lock-in. |
| Evolution and Roadmap | Progression from NVLink 1.0 to 5.0; Fusion technology and licensing; positioning NVLink as an industry standard. |
| Conclusion | Interconnect speed as the core advantage; NVLink's dominance at the system level; outlook for heterogeneous computing. |
Key Glossary: Interconnects and Related Jargon
| Term | What It Means | Why It Matters in AI/HPC |
|---|---|---|
| NVLink | NVIDIA’s proprietary high-speed point-to-point interconnect for GPUs and CPUs. | Enables ultra-fast, direct communication and memory sharing between chips, powering large-scale AI/ML workloads. |
| PCIe (PCI Express) | Industry standard communication bus for connecting peripherals (like GPUs, SSDs) to CPUs. | Ubiquitous and open, but has lower bandwidth and higher latency vs. NVLink. Bottleneck in multi-GPU or large-model setups. |
| CXL (Compute Express Link) | Open interconnect standard designed for low-latency memory sharing between CPUs, GPUs, FPGAs, and accelerators. | Emerging competitor for next-gen, coherent heterogenous computing. Aims to unify memory across devices from different vendors. |
| Infinity Fabric | AMD’s high-speed interconnect technology for connecting CPUs and GPUs. | Enables multi-GPU and CPU-GPU collaboration; alternative to NVLink in AMD systems. |
| Bandwidth | Amount of data that can be transferred per second (e.g., GB/s, TB/s). | Limits the speed of data movement; higher is better for large-scale AI/ML. |
| Latency | Time taken for data to travel from source to destination. | Lower latency means faster communication, critical for distributed training. |
| Switch Fabric/NV Switch | Specialized hardware enabling many chips to connect and communicate simultaneously at full speed. | Removes bottlenecks when scaling beyond small mesh topologies. |
| Chip-to-Chip (C2C) | Direct connection between CPU and GPU without routing over PCIe. | Grace Hopper NVLink C2C enables unified, coherent memory sharing between CPU and GPU. |
| TCO (Total Cost of Ownership) | The comprehensive sum of buying, deploying, maintaining, and powering technology over its usable life. | Lower TCO is critical in data centers; specialized interconnects may raise or lower it depending on efficiency and vendor lock-in. |
| SerDes (Serializer/Deserializer) | Hardware that converts data between serial and parallel forms for transmission over high-speed links. | Essential for moving data rapidly across interconnects like NVLink, PCIe, and CXL with minimal signal loss. |
| Vendor Lock-in | A limitation from using proprietary technology tied to a specific company’s ecosystem. | NVLink delivers major advantages but only on NVIDIA hardware; open standards seek to avoid this lock-in. |
If you buy the argument that interconnect speed defines what problems you can solve, then NVLink stops being a specs table entry, and becomes the real monopoly.
Chapter 1: What Is NVLink?
NVLink is NVIDIA's proprietary, high-speed, point-to-point interconnect designed to link GPUs (and increasingly CPUs) into a unified complex that behaves as a single accelerator.
Instead of routing traffic through the CPU's PCI Express (PCIe) root complex, NVLink creates direct peer-to-peer paths between devices.
Each "link" is a bundle of high-speed differential lanes that aggregate into a mezzanine-style bridge between neighboring GPUs.
By the time you reach Blackwell-generation silicon, an NVLink endpoint can sustain up to 900 GB/s of bidirectional bandwidth per GPU.
| NVLink Generation | GPU Architecture | Year | Total Bidirectional Bandwidth | Performance Trend |
|---|---|---|---|---|
| 1.0 | Pascal P100 | 2016 | 160 GB/s | ≈5× PCIe 3.0 x16 |
| 2.0 | Volta V100 | 2017 | 300 GB/s | ≈2× Gen 1.0 |
| 3.0 | Ampere A100 | 2020 | 600 GB/s | ≈2× Gen 2.0 |
| 4.0 | Hopper H100 | 2022 | 900 GB/s | ≈1.5× Gen 3.0 |
| 5.0 | Blackwell B200 | 2024 | 1.8 TB/s | ≈2× Gen 4.0 |
The jump to 1.8 TB/s keeps a ≈14× headroom over PCIe Gen5 x16 (~128 GB/s), which is why trillion-parameter LLM training still requires NVLink-class collectives rather than commodity buses.
Switchless mesh, switch fabric, and scale
NVLink started out connecting GPUs together in simple mesh or ring layouts -- each GPU had direct connections to a few others, so they could send data straight across without waiting their turn on the PCIe bus.
Later, NVIDIA added a hardware switch (NV Switch) that let a lot more GPUs talk to each other at once, instead of just in small groups. For example, one NV Switch can link up to 16 GPUs in a box, and all of them can share data at full NVLink speed without getting in each other's way.
This architectural choice solves three headaches endemic to PCIe-based clusters:
- Bandwidth: PCIe 5.0 x16 gives about 128 GB/s both ways; NVLink gives much more, so GPUs don't have to wait for data.4
- Latency: NVLink skips the CPU and is much faster inside the server -- delays drop to just a few microseconds.
- Coherence: NVLink keeps memory in sync between GPUs, so code can access GPU or CPU memory directly instead of copying data around.5
Beyond GPU-to-GPU: NVLink C2C
NVLink C2C (Chip-to-Chip) lets a Grace CPU and Hopper GPU connect directly, sharing memory at much higher speed and lower energy use than PCIe.
This means a GPU can use the CPU's memory when it runs out of local VRAM, without a big slowdown. So, bigger models can run without running into memory or power problems.
TLDR
NVLink lets horizontal GPUs "vertically scale" in VRAM and throughput. This leads to better training costs, inference cost. That said, a linked GPU should not be confused to perform the equivalent of unified GPU

Chapter 2: Performance, Latency, and Power Efficiency
The economic penalty for skipping NVLink shows up long before a purchase order clears. Data center operators who wire their clusters with commodity PCIe interconnects discover two brutal truths:
- scaling stalls well below linear
- every idle microsecond bleeds watts.
2.1 The Energy Penalty
NVLink is built for fast, efficient data transfer; PCIe uses more energy per bit to move the same data.6
When transfers are slow, GPUs sit idle but still use power, so you pay extra both for the inefficient transfer and for hardware doing nothing. On big clusters, energy waste per useful compute skyrockets.
| Cost Metric | NVLink-Enabled Cluster | PCIe-Only Cluster | Financial Impact |
|---|---|---|---|
| Training Time | Baseline (X days) | X + 50% or more | Slower releases, higher burn rate |
| Power Efficiency | Low W/Useful TFLOP | ~5× higher W/TFLOP | Larger monthly OPEX for the same job |
| Scaling Efficiency | ~85% | ~65% | CapEx wasted on underutilized hardware |
2.2 The TCO Calculation
At first, picking PCIe instead of NVLink seems like a smart cost-saving move. PCIe hardware is much cheaper up front, and the specs sound fast enough. But as soon as you start training a big model, the real costs show up.
Here's what happens:
- Slower Training: PCIe can't move data fast enough between GPUs, so jobs take way longer to finish. This means you wait extra days -- or even weeks -- to reach your goals, and miss out on new revenue or research because everything slips.
- Wasted Hardware: While GPUs are waiting to send or receive data, they burn power but don't actually compute. So you pay top dollar for GPUs that sit idle for a big chunk of time.
- Huge Power Bills: Longer training + idling GPUs = much higher electricity bills. The more you try to save on hardware, the more money you lose on operations.
Add it all up, and the "discount" vanishes fast. Over just a few projects, paying less up front can actually cost you more than the pricey NVLink setup.
The hidden "tax" is wasted time, wasted energy, wasted hardware.
Everyone says NVIDIA has a monopoly because of CUDA, but that's not really the trap. The real moat is NVLink. Their fabric unlocks performance you just can't get with PCIe, so you end up paying NVIDIA's prices whether you like it or not.
That's what makes their market dominance so deceptive -- it hides in your power bills and your lost time, not just in hardware invoices.

Chapter 3: NVLink Fusion vs. CXL
Having won the present on raw performance, Nvidia can harden its moat by licensing the fabric selectively and turning NVLink racks into the default blueprint for hyperscale AI factories.7
3.1 NVLink Fusion
NVLink Fusion is NVIDIA's new initiative that lets selected partners connect their own custom chips -- like CPUs, AI accelerators, or other devices -- directly into NVIDIA's ultra-high-speed GPU fabric using special NVLink chiplets and interface blocks.
NVLink Fusion is like an adapter that allow third-party silicon to join NVIDIA's super-fast hardware club, but on NVIDIA's terms.
Normally, different chips in a server talk to each other using standard connections like PCIe, which is much slower and struggles to keep up with AI workloads.
NVLink Fusion gives partners access to pieces of the NVLink technology -- such as chiplets (tiny, modular hardware blocks), SerDes (high-speed data links), and the rack-wide switch that ties everything together.
But crucially, NVIDIA does not hand over the full control of the underlying protocol, keeping the essential DNA of NVLink proprietary.8
| Aspect | Description |
|---|---|
| Core Benefit | NVLink Fusion lets custom hardware plug directly and efficiently into NVIDIA's GPUs, gaining the speed, low latency, and shared memory access normally reserved for NVIDIA hardware. |
| Control | NVIDIA determines who gets access and to what depth, preserving its control over the most valuable part of the stack. |
| Tight Coupling (Third-Party Silicon) | MediaTek, Marvell, and others can integrate CPUs or accelerators directly with Blackwell-class GPUs over NVLink C2C, inheriting 900 GB/s coherent links -- without needing to design a competing fabric.9 |
| Rack Lock-In | NVIDIA's GB200 NVL72 reference rack positions NVLink as the core of the system -- custom CPUs aiming for that scale must use NVLink backplanes, not PCIe.7 |
| Partner Dependency | Samsung Foundry and others creating non-x86 CPUs/XPUs are still anchored to NVLink for high-performance scale-up, even as customers pursue semi-custom solutions.10 |
| Ecosystem Impact | Fusion expands ecosystem participation, while ensuring that top-tier training workloads remain centered on NVIDIA GPUs. 11 |
3.2 The Challenger: CXL
CXL (Compute Express Link) is an industry-standard, high-speed interconnect protocol designed for coherently linking CPUs, GPUs, FPGAs, accelerators, and memory devices within servers and data centers.
It rides atop the PCIe physical layer (currently Gen5 and Gen6), but adds advanced memory coherency, direct load/store access, and fine-grained resource sharing between heterogeneous devices.
Unlike PCIe, which is optimized for general I/O and discrete transfers, CXL excels at pooling memory and connecting devices in a more unified, flexible memory space -- allowing system RAM, persistent memory, and accelerator memory to be shared or accessed transparently by multiple components.
This lets disaggregated servers and accelerators dynamically compose resources for workloads, particularly in cloud and inference environments, with much more flexibility than legacy architectures.

3.3 Nvidia not dethroned anytime soon
NVLink's future is a two-track play:
- invite partners into the rack while keeping the only highway that makes the rack economic.12
- Fusion taps custom silicon, CXL absorbs the commodity edge
- NVIDIA keeps the 14× bandwidth differential that dictates performance-per-watt in training clusters.

Conclusion
If you strip away the marketing wars and discussions on flops, CUDA, or even raw chip supply, then you notice the reason NVIDIA dominates AI isn't just better silicon or clever software lock-in -- its NVLink.
As other vendors talk about "openness", NVIDIA has built a network that bakes their advantage right into the server rack.
No competitor -- not CXL, not PCIe, not open accelerators -- can yet offer the seamless, ultra-low-latency, and high-bandwidth connections required for the largest AI models (and the biggest AI customers).
By licensing Fusion and letting others "plug in," NVIDIA appears open, but the only highway that makes hyperscale AI profitable is still theirs.
Even if you swap out the CPU or buy next year’s "open" accelerator, your data will still flow through NVLink’s lanes, routed by NVIDIA’s own switch fabric.
The AI race isn’t just about who makes the fastest chip -- its now about who owns the roads connecting them.
Footnotes
-
What are the Key Differences Between NVLink and PCIe? | AI FAQ - Jarvis Labs ↩
-
NVLink vs PCIe: What's the Difference for AI Workloads - Hyperstack ↩
-
How does NVLink improve the performance of AI model training compared to PCIe? ↩
-
How does NVIDIA's NVLink impact cache coherence in multi-GPU systems? - Massed Compute ↩
-
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion ↩ ↩2
-
NVIDIA Unveils NVLink Fusion for Industry to Build Semi-Custom AI Infrastructure with NVIDIA Partner Ecosystem ↩
-
Reuters: Nvidia's Huang Set to Showcase Latest AI Tech at Taiwan's Computex ↩
-
TechRadar: Samsung Will Help Nvidia Build Custom Non-x86 CPUs and XPUs ↩
-
The Next Platform: Nvidia Licenses NVLink Memory Ports to CPU and Accelerator Makers ↩


