Embedding Projector
Visualise high-dimensional embeddings in 3D and compare PCA vs UMAP projection. Mammoth point cloud, Fashion MNIST image embeddings, GloVe word vectors, and real blog post sentence embeddings — all precomputed, instant switching.
Every AI model that does something useful — classify images, translate text, recommend products, answer questions — first converts its input into a high-dimensional vector. That vector is the model's internal representation of meaning. The problem: humans cannot look at a 384-dimensional vector and understand what it means.
Projection is the practice of compressing those vectors down to 2D or 3D while preserving as much of the original structure as possible — so you can see the geometry with your eyes.
Two methods dominate:
- PCA — finds the directions of maximum variance and projects linearly. Fast, globally consistent, deterministic. Blind to nonlinear structure.
- UMAP — builds a topological graph of local neighbourhoods in high-dimensional space, then optimises a low-dimensional layout to match it. Slower. Clusters survive.
All projections here are precomputed. Nothing runs in your browser.
Projector
What each dataset shows
🦣 Mammoth — point cloud
10,000 points sampled from the surface of a woolly mammoth skeleton, sourced from PAIR-code/understanding-umap and downsampled to 2K via Farthest Point Sampling (FPS). Why downsample? To make your browser run faster.
The mammoth is the canonical benchmark for projection algorithms because its surface is a connected 2D manifold embedded in 3D.
Every segment connects to its neighbours, legs to body, neck to head, head to tusks.
PCA sees a blob with an elongated axis, basically like a light shining through the mammoth and you see a shadow.
UMAP learns the topology and roughly preserves it. The connected segments in 3D remain neighbours in 2D.
The buttons inside the UMAP chart expose how the two hyperparameters trade off:
| Low | High | |
|---|---|---|
| n_neighbors | Local structure. Tight clusters, may fragment the global shape. | Global structure. Body parts stay connected. |
| min_dist | Tight, dense clusters. | Spread-out, diffuse layout. |
👗 Fashion MNIST — image embeddings
2,000 items from Fashion MNIST (Zalando Research): T-shirts, trousers, dresses, sneakers, bags.
Each item is a 28×28 greyscale image -- 784 dimensions of raw pixel values. You compress 784 to 3 or 2D. Hover any point to see its thumbnail.
Now, PCA in 2D blurs the categories together, the first two variance axes don't align with the garment-type boundary.
UMAP separates footwear from tops from trousers into distinct islands, because the pixel-level similarity structure of shoes is genuinely different from the pixel-level structure of shirts, even in raw pixel space.
Why this matters: this is how AI can reliably regenerate certain styles of images visually, instead of using a language tag.
It is also how similar images are clustered together in the latent space of a vision model, which also have consequence for MusicAI.
"Wait what? MusicAI? But it is IMAGE!"
But MusicAI IS an image. Basically, it is a transcribed spectrogram of a song. Mindblown? Checkout Audio Forensics to learn more.
💬 Word Embeddings — GloVe-50D
160 words from GloVe-6B-50D (Stanford NLP), sampled across 8 semantic clusters: animals, countries, emotions, food, tech, royalty, sports, geography.
Each word is a 50-dimensional vector trained on co-occurrence statistics across Wikipedia and Gigaword. Hover any point to see the word label.
This is the primitive one, back when we are still figuring if you can turn text into vectors.
The geometry makes word analogies work: the vector offset king → queen is approximately the same as man → woman, because the direction encoding royalty and the direction encoding gender are orthogonal axes in the 50D space.
Now, you see the clear advantage of UMAP over PCA. UMAP reveals the clusters; PCA smears them along variance axes that don't align with semantic boundaries.
Limitation: GloVe is a static embedding — "bank" gets one vector regardless of context (riverbank vs. financial bank). Modern models like BERT produce context-dependent vectors. The geometry is more complex but also more useful.
📝 Blog Posts — sentence embeddings
593 paragraph chunks from 22 posts published on this blog across 2025, embedded with all-MiniLM-L6-v2 (384 dimensions).
Each point is a 3-sentence sliding window of text of my blog content. Colour = source post. Hover to read the chunk.
Previous 3 tabs are published content I borrowed for this page. This tab is my novel content and the intention is to reflect how modern AI applications work.
When you build a RAG (Retrieval-Augmented Generation) pipeline, a semantic search index, or a recommendation engine, this is what the embedding space looks like internally: your content, chunked, embedded, and laid out by meaning.
Notice what clusters together and what doesn't. Posts on economics and power cluster across article boundaries. Tech posts on software migration and system design sit near each other even though they were written weeks apart.
The model has never read these articles; it infers proximity entirely from co-occurrence statistics learned during pretraining.
The same idea is also used by recommendation systems to cluster users and items into SimClusters. That's why left-wing people see more left-wing content and right-wing people see right-wing content.
Why people do this
Projection is used at every stage of the modern AI development cycle:
Debugging model representations. Before fine-tuning, project the base model's embeddings on your task data. If the classes already separate in UMAP, a linear head will work. If they overlap, you need more training data or a different encoder.
Understanding retrieval quality. In a RAG pipeline, bad retrieval is the most common failure mode. Projecting your document chunks shows you whether semantically similar chunks are actually close in embedding space — or whether the chunking strategy is fragmenting meaning across too many pieces.
Detecting data drift. Embed a production sample and overlay it on your training distribution. Points that land outside the training cluster are out-of-distribution — the model is extrapolating, not interpolating.
Content and recommendation systems. User preference vectors, item embeddings, and behaviour traces all live in the same high-dimensional space. Projection makes it visible which items are substitutes vs. complements, and whether user segments form natural clusters.
Cluster discovery. Before labelling a new dataset, project it. Natural groupings appear as islands. You can then label representatives of each cluster rather than labelling every point — a significant reduction in annotation cost.
Why UMAP beats PCA for embeddings
PCA is optimal for one task: finding the linear subspace that minimises reconstruction error in L2.
For data whose structure is nonlinear like word clusters that are semantically coherent but not linearly separable, garment categories that differ in texture more than in average pixel brightness, PCA spreads and merges them.
TLDR: You can't turn "Content Universe" into a shadow projection!
UMAP models the data as a weighted graph: each point connects to its n_neighbors nearest neighbours, with edge weights that decay with distance according to a fuzzy-set membership function.
It then finds a low-dimensional graph that minimises cross-entropy against the high-dimensional one.
The result is a layout that preserves which points are neighbours, not just which directions have the most variance.
If that is a lot of words to you, it means UMAP finds the neighbours for every point.
What's the catch?
The catch is UMAP results vary with random seed, sensitive to hyperparameters, and runs in O(n log n) with approximate nearest neighbours.
With 10k points, the run takes 30–90 seconds in Python with CPU. That is why the projections here are precomputed offline.
I show you the mammoth to show the difference concretely. Both algorithms see the same 2K points.
PCA sees a 3D blob and projects its longest axis onto the page and yo usee a "shadow".
UMAP sees a surface -- a skeleton (?) and attempts to unfold it. Switching between the two views is the clearest demonstration of what structure each method preserves.
When PCA still wins
UMAP is not always the right tool. The choice depends on what you need to do with the result — not just how it looks.
| Scenario | Use | Why |
|---|---|---|
| Preprocessing before UMAP | PCA | UMAP is slow on very high-D data. Running PCA to 50D first (e.g. 784D → 50D → 2D) cuts runtime by 10–100× with minimal information loss |
| Reproducible pipelines | PCA | PCA is deterministic. UMAP varies with random seed — two runs on the same data produce different layouts, making automated comparison unreliable |
| Anomaly detection | PCA | Reconstruction error (distance from the PCA subspace) is a clean, interpretable anomaly score. UMAP has no equivalent |
| Feature importance | PCA | Each principal component is a linear combination of original features. You can read off which input dimensions drive each axis |
| Very small datasets | PCA | UMAP's nearest-neighbour graph is unstable below ~200 points. PCA works at any size |
| Global distance matters | PCA | UMAP distorts long-range distances to pull clusters apart. If you need point A genuinely farther from B than from C, PCA preserves that ordering better |
| Visualising cluster separation | UMAP | Nonlinear clusters that PCA merges become visually distinct islands |
| Debugging encoder quality | UMAP | Class structure invisible in PCA often appears in UMAP — revealing whether a backbone has already learned the task |
| RAG / retrieval audit | UMAP | Chunk proximity in embedding space is nonlinear; UMAP shows retrieval neighbours more honestly than PCA |
The practical heuristic: use PCA to understand the data globally, UMAP to understand it locally. In production systems, PCA appears in the data pipeline (whitening, compression, anomaly scoring); UMAP appears in the exploration and debugging stage, not in the serving path.
Sources
- PAIR-code/understanding-umap — mammoth point cloud dataset and the interactive UMAP explainer that inspired this lab
- McInnes, L., Healy, J., Melville, J. — UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018)
- Pennington, J., Socher, R., Manning, C. — GloVe: Global Vectors for Word Representation (2014)
- Xiao, H., Rasul, K., Vollgraf, R. — Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms (2017)
- Reimers, N., Gurevych, I. — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019) — basis for
all-MiniLM-L6-v2
Audio Forensics
Eight instruments for reading audio provenance — waveform dynamics, spectrogram, spectral profile, noise floor, MFCC trajectory, dynamics, spectral flux, and feature summary. A teaching tool for understanding how statistical audio analysis works.
Human or AI?
A real-time behavioral profiler. The machine learns your patterns as you type — then tells you how human you look.