NelworksNelworks
Ai

Audio Forensics

Eight instruments for reading audio provenance — waveform dynamics, spectrogram, spectral profile, noise floor, MFCC trajectory, dynamics, spectral flux, and feature summary. A teaching tool for understanding how statistical audio analysis works.

Every audio signal carries a statistical fingerprint. Not in what you hear, but in the numbers underneath.

The shape of the spectrum, the smoothness of the MFCC trajectory, the variance of spectral flux between frames. These patterns differ between a recording of physical acoustics, a digital synthesiser render, and a diffusion model's vocoder output.

Audio forensics is the science of reading those patterns.

These instruments are naive implementations using standard signal processing features: Meyda's MFCC, manual STFT, per-frame spectral statistics.

They demonstrate the logic of audio forensics and give you something concrete to interact with. A production classifier would train on thousands of labelled audio pairs, use learned embeddings rather than hand-computed features, and validate against adversarial examples.

What you see here is the approach, not the capability.

All analysis runs in your browser. No audio data is uploaded or transmitted. Decoding and feature extraction happen locally via the Web Audio API and Meyda.


Audio Forensics Tool

Audio A

116 D-min

Bury the Light

Bury the Light (vocals)

No audio loaded

Audio B

116 D-min

Bury the Light

Bury the Light (vocals)

No audio loaded


How a computer hears audio

To a computer, audio is a sequence of numbers -- a Float32Array of pressure values sampled at a fixed rate (44,100 times per second for CD quality).

There is no concept of "this sounds like a guitar" or "this sounds AI-generated." There are only those numbers.

Everything on this page is a mathematical operation on that sequence. The question is not "does this sound real?" but "do these numbers behave like a diffusion model's output, or like a synthesiser's?"


How AI audio is generated

Most current AI music systems (Suno, Udio, MusicGen, AudioLDM) learn patterns on mel spectrograms, IE: compressed visual representations of sound, rather than on raw waveforms.

Generating a track means generating one of those images (yes, AI creates the image of the sound you gonna hear later), then converting it back to audio using a vocoder.

That conversion step is where the forensically detectable artifacts appear.

StepWhat happensAnalogyWhat gets lostForensic consequence
Record / synthesiseReal acoustic event or DAW plugin produces a waveform with precise phase relationships between harmonicsThe original painting — every brushstroke is therePhase encodes the physics of the source
Mel spectrogramWaveform is compressed into a 2D image: time × mel frequency bins. Amplitude is preserved; phase is mostly discardedA photograph of the painting — composition intact, brushstroke texture gonePhase informationThe representation is lossy — reconstruction must invent what was discarded
Diffusion modelModel learns to generate plausible mel spectrograms from text prompts, trained to produce smooth gradual changesAn artist who has only ever seen photographs, now asked to paint from scratchSharp note-boundary transitionsMFCC trajectories are unnaturally smooth; spectral flux variance is lower than real synthesis
Vocoder decodeLearned algorithm reconstructs a waveform from the mel spectrogram, inventing phaseRecreating the canvas from the photograph — colours match, brushstrokes are guessedCannot recover physical phase relationshipsCharacteristic smearing above 4–8 kHz, visible in spectrogram; shimmer in broadband noise
Loudness normalisationPlatform applies gain to hit −14 LUFS target before deliveryPrinting every photo at the same exposure regardless of the original lightingDynamic rangeCompressed crest factor; waveform looks uniformly dense

What each instrument reads

InstrumentWhat the computer measuresDAW renderAI (diffusion vocoder)
WaveformPeak-envelope amplitude over time; crest factor (peak ÷ RMS)High crest factor if unmastered; natural amplitude variationLower crest factor — loudness-normalised before delivery
SpectrogramPower at each frequency over time (STFT)Clean harmonic series from synthesised instruments; discrete note attacksHorizontal smearing above 4–8 kHz from vocoder phase reconstruction; blurred harmonic lines
Spectral ProfileAverage power spectrum across all frames; centroid, rolloff, flatnessDetermined by synthesis engine and mixing decisionsCharacteristic rolloff shape from mel filterbank + vocoder bandwidth limit
Noise Floor10th-percentile power at each frequency binNear-zero digital floor (no sensor noise in DAW renders)Near-zero digital floor — same as DAW; this tab distinguishes both from mic recordings
MFCC13 Mel-Frequency Cepstral Coefficients over timeStochastic variation at note boundaries; sharp transitions from synthesiser envelopesUnnaturally smooth trajectories — diffusion model learned gradual spectral evolution
DynamicsPer-frame RMS energy; dynamic range; loudness histogramWide dynamic range if unmastered; natural energy contourCompressed dynamic range (6–10 dB typical); uniform loudness histogram
Spectral FluxFrame-to-frame change in magnitude spectrum; onset detectionSharp onsets at note attacks; high flux varianceSmoother flux; fewer high-magnitude onset peaks
Feature ProfileAll of the above as summary statistics

The samples

The preset samples are two recordings of the same composition:

  • DAW sample — a direct render from a digital audio workstation. No physical microphone. Clean digital floor. Unmastered. The dynamic range and spectral balance reflect the raw mix, not a commercially delivered track.
  • AI cover — a Suno generation referencing the same composition, produced using prompts from the Suno Builder tool on this site. Delivered at streaming loudness. Generated via diffusion model with vocoder decode.

What this comparison can and cannot show:

TabDiagnostic for AI vs DAW?Confound
Waveform / DynamicsPartiallyDetects mastering decisions as much as generation method — unmastered DAW will have wider dynamic range than normalised AI regardless of origin
Spectrogram / MFCCYesVocoder artifacts and MFCC smoothness are independent of mastering
Spectral FluxYesOnset sharpness reflects generation method, not post-processing
Noise FloorNo (for these samples)Both are digital sources — neither has sensor noise; the tab is meaningful when comparing either to a microphone recording
Spectral ProfilePartiallyRolloff shape is partly a vocoder fingerprint, partly a mix/EQ decision

Tools of the trade

Understanding the analysis is easier once you know what tools were used at each stage of production, and what traces they leave.

Mixing

ToolWhat it doesForensic signature
EQ (equaliser)Boosts or cuts specific frequency bands; shapes tonal balanceSpectral profile shows shelves and notches at the cut/boost frequencies; centroid shifts
CompressionReduces dynamic range by attenuating peaks above a thresholdLower crest factor; waveform looks more uniform; dynamics histogram narrows
Reverb / delayAdds spatial impression by convolving with a room impulse response or feedbackSpectrogram shows frequency-dependent decay tails after transients
Saturation / distortionAdds harmonic content by soft-clipping the signalSpectral profile shows energy added at harmonics; flatness increases
Panning / stereo widthPlaces audio in the stereo fieldAffects inter-channel amplitude ratio; not visible in mono-channel analysis on this page

Mastering

ToolWhat it doesForensic signature
Multiband compressionIndependent compression per frequency band; often used to tighten low-endSpectral profile becomes more uniform across bands; per-band dynamic range narrows
LimiterHard ceiling on peak amplitude (used to maximise loudness)Very low crest factor; waveform flat-topped near 0 dBFS; peak and mean RMS converge
Loudness normalisation (LUFS)Adjusts overall gain to a target integrated loudness (e.g. −14 LUFS for streaming)Predictable RMS level; dynamic range preserved but absolute level is standardised
Stereo enhancementWidens stereo image via mid/side processing or Haas effectNot visible in single-channel analysis
DitheringAdds shaped noise to mask quantisation artifacts when reducing bit depthAdds a characteristic noise floor shape in the high frequencies

What AI generation replaces

StageHuman workflowAI (Suno / diffusion model)
CompositionDAW arrangement, MIDI, synthesisPrompted from text; no MIDI or per-instrument tracks
MixingManual EQ, compression, panning per trackImplicit — baked into the model's training distribution
MasteringSeparate mastering chain, LUFS targetingLoudness normalisation applied at delivery; no separate mastering stage
Codec / deliveryExport to MP3/WAV/FLACVocoder decode → MP3 at platform bitrate

The key forensic consequence: an AI audio has no explicit mix or master.

A spectral characteristic was learned jointly, not applied as a discrete step. This is why the vocoder artifacts appear simultaneously with the loudness normalisation signature.


What should I look for?

PrincipleApplied to audio forensics
Converge across instrumentsNo single feature is conclusive. Spectrogram smearing + smooth MFCC + compressed dynamics all pointing the same direction is a composite case. One anomaly can't tell you if something is AI.
Control for masteringDynamics and waveform tabs reflect mastering as much as generation method. Compare only tracks at similar loudness targets, or treat dynamics as a secondary signal.
The baseline is the sourceDiffusion model fingerprints differ by architecture: Suno, Udio, MusicGen each have characteristic spectral shapes. A classifier trained on one architecture won't generalise to another without retraining.
Digital is not AIDAW renders, digital synthesisers, and sample-based music all have near-zero noise floors and clean digital characteristics. These instruments cannot distinguish digital synthesis from AI generation based on the noise floor alone.
Adversarial training is closing the gapVocoders trained with perceptual losses are reducing spectral smearing artifacts. The most durable tells are MFCC smoothness and spectral flux variance. They require learning genuinely different generation dynamics to fake, not just perceptual filtering. But with good enough learning, even current forensic tools eventually fail.

What a real audio forensics pipeline looks like

ComponentWhat this page usesWhat production uses
Feature extractionMeyda per-frame features (MFCC, spectral flux, RMS) — hand-selected, 13 coefficientsLearned embeddings from models like CLAP, EnCodec, or AudioMAE — trained on millions of audio examples, capturing patterns no human specified
ClassifierNone. We cannot load a giant classifier model in your browser and we dont want to use our cloud resources.Trained binary or multi-class classifier with calibrated probability estimates; validated on held-out test sets from multiple model families
Temporal modellingPer-frame statistics averaged or plotted — no sequence modelLSTM, Transformer, or conformer over the full feature sequence — detects patterns that span multiple seconds, not just single-frame statistics
Vocoder fingerprintingNot implementedEach vocoder architecture (HiFi-GAN, EnVoice, DAC) has a characteristic phase-reconstruction pattern detectable by a model trained on its outputs
Adversarial robustnessNot testedRed-teamed against: pitch shifting, time stretching, re-encoding at different bitrates, adding noise, re-recording through a speaker

Glossary

TermWhat it means
WaveformThe raw audio signal — a sequence of pressure values over time. What you see when you open an audio file in a DAW. Amplitude on the Y axis, time on the X axis.
Sample rateHow many times per second the pressure is measured. 44,100 Hz (44.1 kHz) is CD quality — 44,100 numbers per second per channel. Why? Because it is the Nyquist frequency of the human hearing, that is 20kHz.
dBFSDecibels relative to Full Scale. 0 dBFS = the loudest possible digital value. Everything else is negative. −6 dBFS is roughly half the peak amplitude.
RMSRoot Mean Square — the square root of the average of squared amplitude values. A proxy for perceived loudness, more meaningful than peak.
Crest factorPeak amplitude ÷ RMS, expressed in dB. High crest factor = lots of headroom between quiet and loud moments (natural). Low = compressed or limited.
LUFSLoudness Units relative to Full Scale. A perceptually weighted loudness measure. Streaming platforms (Spotify, YouTube) normalise to −14 LUFS.
Fourier transform / STFTA mathematical operation that decomposes a signal into its frequency components — "how much of each frequency is present." STFT (Short-Time Fourier Transform) does this in short overlapping windows to track how the frequency content changes over time.
SpectrogramA 2D visualisation of the STFT result: time on the X axis, frequency on the Y axis, energy (loudness at that frequency) shown as colour or brightness. A sustained note shows as a horizontal band; a transient (drum hit) shows as a vertical streak.
Mel scaleA perceptual frequency scale. Human hearing distinguishes pitches more finely at low frequencies than at high frequencies. The mel scale compresses the high-frequency range to reflect this — 1 kHz vs 2 kHz sounds like a bigger jump than 8 kHz vs 9 kHz, even though the Hz difference is the same.
Mel spectrogramA spectrogram whose frequency axis is on the mel scale, and frequency bins are grouped through triangular mel filterbanks. Used by AI systems because it matches how humans perceive pitch. The compression discards fine phase detail.
VocoderOriginally a device for encoding and reconstructing speech. In modern AI audio: a neural network that converts a mel spectrogram back into a raw waveform. HiFi-GAN and EnCodec are widely used examples. The vocoder must invent phase that the mel spectrogram discarded.
PhaseThe timing offset of a wave cycle relative to a reference. Two sine waves at the same frequency but different phases sound identical in isolation but interact (add or cancel) when mixed. Real acoustic instruments produce harmonic phase relationships determined by physics; vocoders invent them.
MFCCMel-Frequency Cepstral Coefficients. A compact description of the shape of the spectrum at a given moment — roughly, the "texture" of the sound. Computed by: mel spectrogram → log → inverse Fourier transform → first 13 coefficients. Widely used in speech recognition and audio classification. Coefficient 0 tracks loudness; coefficients 1–12 track timbral shape from broad to fine.
Spectral centroidThe frequency-weighted average of the power spectrum — the "centre of mass." Low centroid = bass-heavy / dark sound. High centroid = treble-heavy / bright sound.
Spectral flatnessRatio of the geometric mean to the arithmetic mean of the power spectrum. Near 0 = tonal (energy concentrated at discrete harmonics). Near 1 = noise-like (energy spread evenly across frequencies).
Spectral fluxThe frame-to-frame change in the magnitude spectrum. High flux = rapid spectral change (note attacks, transients). Low flux = sustained, static content. Used here to measure onset sharpness.
Noise floorThe residual energy present in a signal even during silence. For microphone recordings, this is dominated by preamp and sensor noise. For digital sources (DAW renders, AI audio), it reflects codec quantisation noise — much lower and shaped differently.
Diffusion modelA generative model trained to iteratively remove noise from a noisy input. At inference, it starts from random noise and progressively refines it into a structured output (image, mel spectrogram, etc.). Suno, Udio, and MusicGen use variants of this approach.
Modulation spectrumThe spectrum of the amplitude envelope of a frequency band — "how fast is the loudness of this frequency band oscillating?" A peak in the modulation spectrum at 50–200 Hz in the high-frequency band indicates periodic amplitude ripple: the vocoder shimmer artifact.

Further reading

Signal processing foundations

  • Rabiner, L., Juang, B.-H. — Fundamentals of Speech Recognition (1993). The standard reference for MFCC computation and cepstral analysis. Chapter 3 covers the mel filterbank derivation.
  • McFee, B. et al. — librosa: Audio and Music Signal Analysis in Python (2015). The de facto Python audio analysis library. Readable source code for STFT, MFCC, spectral features.

Meyda

  • Rawlinson, H., Segal, N., Fiala, J. — Meyda: An Audio Feature Extraction Library for the Web Audio API (2015). The library used on this page. meyda.js.org

AI audio generation and detection

  • Défossez, A. et al. — High Fidelity Neural Audio Compression (2022). EnCodec — the codec underlying many current AI audio systems. Explains how audio is quantised into a latent space and reconstructed.
  • Kong, J. et al. — HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2020). One of the most widely used vocoders. The phase reconstruction approach that produces the spectrogram artifacts visible in the spectrogram tab.
  • Kang, X. et al. — Deepfake Audio Detection (2023). Survey of audio deepfake detection methods. Covers the limitations of spectral feature approaches against adversarially trained models.