EP01 - AI Music Generation

How AI generates full songs instantly. Learn about multi-stage pipelines, unified spectrograms, neural vocoders, latent space compression, and cost optimization in AI music models like Suno.

Tweet coming soon

Making music is hard.

That's... a legitimate banger.

Kurumi! How? It's not just a melody. It has layers. It sounds... produced.

Music AI is an efficient piece of signal processing. It can generate music faster than you can listen to it.

Because it is. But not the way you think.

It's not an automated band in a studio. It's more like a single machine painting.

Painting? It's a sound I'm hearing.

Your prompt goes to an LLM. It generates thematic lyrics and tags like `[Verse]` and `[Chorus]`.

The lyrics and style tags are converted into a set of abstract mathematical tokens. A tiny map that encodes the overall vibe and flow.

Okay, but how does it handle all the instruments?

This is the clever cheat code. Instead of building up every instrument or stem, it generates a *single* mel-spectrogram. The full, finished picture of the whole song, all at once.

Just one?

Yup. No multi-track recorder, no separate diffusion magic for drums or vocals. It's more like a futuristic flatbed scanner for music.

The model starts with noise and sharpens it, one pass, using the blueprint as guidance. Everything—the kick, the pad, the harmonies—gets painted in at once.

Okay, so it's one big picture. How do you turn it into sound?

The whole spectrogram goes through a neural vocoder pass, something like EnCodec or HiFi-GAN. No fancy mixdowns, no per-instrument rendering.

Hold on. If everything is baked into one spectrogram and vocoded in a single pass… shouldn't it sound like mud? Kick drum fighting the bass, vocals drowning in reverb. How does it still sound so good?

It should. And with 2022 models it did. But 2025 models are scary good for one simple reason…

Look here — see how the bass energy dips exactly where the kick hits? See the 3–5 kHz notch when the vocal comes in? That's not a mixing engineer. That's the U-Net hallucinating negative space.

But you are right that this architecture can't produce sounds that are perfect. Artifacts sneak in from 'spectral peaks.' That's how AI-generated songs get detected.

We call these spectral peaks or diffusion artifacts. They're leftover deconvolution errors from the denoising steps. Human ears hear them as a subtle metallic sheen, glassy, or 'sparkly' quality — especially on hi-hats, vocal sibilance, and reverb tails.

A real reverb tail decays chaotically. An AI tail decays in perfect little steps because the model was denoising in fixed iterations. Your brain notices the grid pattern even if you can't consciously hear it.

That's creepy. It's like hearing the pixel grid in audio.

But who knows what happens in a year or two. The artifacts are getting smaller every model release. One day even golden ears won't be able to tell.

So... Music AI isn't an artist or an orchestra. It's a compression wizard.

But is it... creative?

Wait a sec.

I work with cloud budgets. GPUs aren't free. Neither is bandwidth for audio.

And all this is just to let me mess around with rap prompts that tricks it to say the N-word for free? How is Suno still profitable?

It would be, if they weren't obsessed with optimizing. They attack the three cost monsters: **Compute**, **Storage**, and **Bandwidth**.

Okay, break down Compute. How do they make it not outrageously expensive?

Latent space. They compose and diffuse in a tiny, compressed version of the song—a thumbnail, not a mural.

If they tried to generate the final, detailed waveform directly, it would eat their GPU bill alive. Diffusing in the latent space means the hard part happens fast, small, and cheap. Scaling up with a vocoder is quick and efficient.

They shrink the work. But storage? Every song must add up fast.

It just deletes it?!

It deletes the *audio*, but it keeps the *recipe*. The original prompt and the latent tokens—the tiny thumbnail sketch from before. That's only a few kilobytes of text.

If you come back a week later and want to play it, the system just re-generates it on the fly from that recipe. A little bit of GPU cost is cheaper than paying to store a million abandoned songs forever.

That's brutal. What about bandwidth for playback?

The final trick: streaming. They use the most compressed audio codec they can get away with.

Not MP3, not WAV. They use **Opus**, the same thing Discord uses for voice. They can stream a song to you using a fraction of the data.

So: They compute on a tiny version of the song. They delete the full version if it's not popular. And they vacuum-seal the data before they send it to you.

I can't believe engineers sometimes can be more creative than artists!