NelworksNelworks
Ai

Stylometrics

Four instruments for reading your own writing — Zipf fingerprint, predictability waveform, vocabulary drift, and cognitive framing analysis. A teaching tool for understanding how forensic stylometry works, not a production classifier.

Every writer leaves a fingerprint. Not in what they say — in how they say it: the words they reach for first, the rhythm their sentences fall into, the concepts they orbit without naming.

These patterns are below the threshold of conscious editing. You cannot choose your Zipf distribution any more than you can choose your gait.

Or you can just look for the emdashes (I intentionally left it there, or maybe this is CLaude writing this post?)

Stylometrics is the science of reading those patterns. It was used to resolve the authorship of the Federalist Papers. It surfaces in forensic linguistics, plagiarism detection, and authorship attribution. The same instruments that identify a forger can help a writer see themselves.

These instruments are naive classifiers — fixed word lists, simple frequency counts, hand-chosen dimensions. They demonstrate the logic of forensic pattern-mining and give you something to interact with. They are not what a real forensics pipeline looks like. A production system would use transformer embeddings, topic models trained on thousands of documents, and classifiers validated against labelled ground truth. What you see here is the concept, not the capability.

These tools are entirely client-side. Nothing you paste here is transmitted anywhere.


Stylometry

Quick load:

What each instrument measures

InstrumentWhat it doesKey signal
① Zipf MirrorPlots word frequency distribution on a log-log scale — the shape all natural language takesWords above the ideal line appear more often than their rank predicts — those are your fingerprint
② Predictability ScannerComputes per-sentence information content from your own vocabularyHigh-surprise sentences use words rare relative to your text — that is where you were present, not on autopilot
③ Vocabulary DriftCompares two writing samples from different periodsNew words entering your top vocabulary and old ones dropping out measure how much Period A resembles Period B
④ Decision SignatureScans for word-level markers across four cognitive framing dimensions: certainty, time horizon, individual vs collective, loss sensitivityWhich pole each dimension's vocabulary leans toward — four dimensions chosen to be illustrative, not authoritative

Instrument glossary

InstrumentWhat it computesSignal strengthWhere it breaks down
Zipf MirrorLog-log rank-frequency curve; Zipf exponent α; words above the ideal lineTopic-independent — works across any subject matterLLMs mimic the global distribution — α converges with human text
Predictability ScannerPer-sentence information content (−log₂ p(word)) using the text's own vocabularyPartially topic-independent — uses in-text vocabulary, not a fixed dictionaryShort texts have unstable probability estimates
Vocabulary DriftZipf-distinctive content words compared across two periods; overlap coefficientSame-domain texts only — measures preoccupation shift, not topic shiftComparing different-topic texts inflates drift score regardless of authorship
Decision SignatureWord-list frequency scan across four cognitive dimensions: certainty markers, temporal framing, pronoun type, gain/loss vocabularyPartially topic-independent — lexical, not syntacticSmall fixed word lists; easily defeated by domain vocabulary or deliberate word choice; dimensions are illustrative, not empirically derived

Key terms:

  • Zipf exponent α — steepness of the rank-frequency slope. Natural language clusters around 1.0. Higher α means vocabulary is concentrated on fewer words. Lower α means broader spread.
  • Type-Token Ratio (TTR) — unique words ÷ total words. A rough proxy for lexical diversity. Length-dependent: longer texts always produce lower TTR, so only compare samples of similar length.
  • Information content (IC) — how surprising a word is: −log₂(count/total). A word appearing once in 1,000 carries ~10 bits. A word appearing 100 times in 1,000 carries ~3.3 bits.
  • Characteristic words — content words that appear at 15%+ above their expected Zipf frequency. These are the words you reach for more than the baseline predicts — your actual fingerprint.
  • Overlap coefficient — |A ∩ B| ÷ min(|A|, |B|). 0 = no shared words, 1 = one set is a subset of the other. Used in Vocabulary Drift to measure how much the fingerprint changed.

How much text do you need?

The relationship between sample size and analytical quality is non-linear. The first 500 words buy most of the signal. Past ~5,000 words, adding more text changes nothing.

InstrumentToo little — noiseMinimum viableReliableSaturates
Zipf Mirror< 100 words~200 words~1,000 words~5,000 words
Predictability Scanner< 5 sentences10–20 sentences (~200 words)50+ sentences (~750 words)~200 sentences
Vocabulary Drift (per period)< 300 words~600 words~1,500 words~5,000 words
Decision Signature< 80 words~150 words~500 words~2,000 words (word-list saturation)

What "too little" looks like in practice:

  • Zipf Mirror with < 100 words: only ~50–80 unique words, Zipf fit R² drops below 70%, distinctive word list is empty or arbitrary.
  • Predictability Scanner with 3–4 sentences: the vocabulary probability estimates are too sparse — every word looks equally surprising.
  • Vocabulary Drift with short texts: fewer than 8 characteristic words are found, so the overlap is either 0% or 100% by chance. The tool falls back to raw top-N in this case, which is the less meaningful comparison.

What "too much" looks like:

There is no accuracy penalty for longer texts — but past saturation the numbers stop moving. A 50,000-word corpus gives you the same Zipf α as a 10,000-word one from the same author. The marginal cost is milliseconds of compute, not quality.

The topic confound in Vocabulary Drift:

Vocabulary Drift measures what you write about as much as how you write. If Period A is economics and Period B is personal essays, the drift score reflects the domain change, not the author's evolution.

For a meaningful reading: use samples from the same subject area, different time periods. The sample loader uses two economics posts from different years as a baseline. For your own writing, sort by tag before comparing.


What should I look for?

There is no right number to cross. There is no threshold above which a drift score becomes meaningful or below which a Zipf α becomes suspicious.

Anyone who gives you one is an overconfident idiot who is overfitting to a specific dataset in a specific context.

PrincipleWhat it means in practice
Never use a single numberA Zipf exponent of 0.62 is not evidence of anything on its own. Forensic analysts look for convergence — multiple independent instruments pointing the same direction simultaneously. One unusual score is noise. Three unusual scores across different methods, on the same text, in the same direction, begins to be signal.
The baseline is you, not a universal standardYour Zipf α of 0.66 is meaningful only relative to your other texts — not to some abstract "human range." A forensic examiner builds a reference corpus of known writing before drawing conclusions from disputed material. These tools tell you something useful only after you've run several of your own texts and seen your normal range. Deviations from your own baseline matter. Deviations from a population average do not.
Express probability, not proofA real forensic report does not say "this text was written by Person X." It says "the features are consistent with Person X's known writing and not inconsistent with the hypothesis." That careful language reflects that authorship attribution has a known false-positive rate and must survive cross-examination. A high drift score means the vocabulary changed. It does not mean you became a different person.
Control for confounds explicitlyBefore comparing two texts, document: What is the genre of each? Is one formal, one informal? Are the topics different? Is one heavily edited? Each uncontrolled confound is a possible alternative explanation for any finding. Vocabulary Drift between an economics post and a travel essay tells you the topic changed — nothing more.
Applied to these toolsRun them on your own writing first. Build a personal baseline across several pieces. Then use the instruments comparatively — not to produce a verdict, but to raise questions worth investigating by hand. The instrument surfaces the anomaly. You supply the interpretation.

What to watch for — without reading

These patterns are detectable in seconds on any text, before you've processed the meaning:

TellWhat to look forHuman vs AI
Sentence rhythmSkim paragraph lengths and sentence lengthsHuman: high variance — one-sentence paragraphs next to six-sentence ones. AI: metronomically even, 3–4 sentences per paragraph, 15–25 words per sentence. Coefficient of variation (std ÷ mean): 0.3–0.5 for AI, 0.7–1.2 for human
Transition word densityScan for: however, furthermore, notably, importantly, additionally, it is worth noting, in conclusion, to summarizeHuman: sparse and inconsistent. AI: structural scaffolding, often once per paragraph at the opening. Three or more in a 500-word passage is a tell.
Em-dash usageCount em-dashesHuman: used for rhythm and interruption. AI (especially post-2023): used as a clause separator everywhere, at 3–8× the human baseline
Hedging without stakesRead the uncertainty languageHuman hedging is specific: "I'm not sure about the 2019 numbers." AI hedging is generic: "it is important to note that," "this is a complex topic." The difference is whether the uncertainty points at something concrete.
Distinctive word characterLook at what words appear repeatedly beyond common EnglishHuman: "that jestergooner got mogged by a foid goycel, spiking his cortisol." AI: abstract nouns and evaluative adjectives — nuanced, robust, comprehensive, framework, landscape, ecosystem, paradigm. "You are absolutely right!"

What a real forensics pipeline looks like

The instruments on this page are pedagogically useful but technically naive. Here is the gap between what you see here and what a production system would use:

ComponentWhat this page usesWhat production uses
Topic discovery4 hand-chosen dimensions (certainty, time, frame, loss)BERTopic, LDA, or NMF trained on thousands of documents — discovers dimensions empirically from the data. Discovered topics may number in the dozens and may not correspond to any human-named category.
Vocabulary coverage~20 words per poleLIWC (Linguistic Inquiry and Word Count): 90+ categories, thousands of words, validated against human judges. Or dense LLM embeddings that capture semantic proximity — "mitigation strategies" and "being careful" both express caution; word-list matching catches one, embeddings catch both.
Classifier traininghi-word count ÷ total — a bag-of-words ratio, same logic as a 1990s spam filterTrained end-to-end on labelled author pairs, cross-validated to estimate false-positive rates, with held-out test sets. Weights reflect what actually predicts authorship in training data, not what a human assumed would matter.
Adversarial robustnessNot tested against adversarial examplesReal forensic tools are red-teamed. Every instrument on this page can be defeated in under a minute: delete hedging words to move the certainty score, pad with high-information sentences to flatten the predictability waveform.
Right expectationIllustrative numbers, not authoritative verdictsThese instruments explain the approach — writing carries measurable statistical regularities that differ between authors and between human and AI writers. The idea is real and the field is active. The specific numbers this page produces are illustrative, not authoritative.

Further reading

The field splits into classical authorship attribution (pre-LLM) and the newer AI-detection literature. They use different assumptions and different failure modes.

Foundations

  • Mosteller & Wallace (1964) — Inference and Disputed Authorship: The Federalist. The paper that proved statistical stylometry works at scale. They identified Madison as the author of the disputed Federalist Papers using function word frequencies alone.
  • George Zipf — Human Behavior and the Principle of Least Effort (1949). The original formulation. More interesting than the equation: the argument that word frequency distributions are an equilibrium between speaker effort and listener effort.

Authorship attribution

  • Patrick Juola's work on unmasking J.K. Rowling as Robert Galbraith (2013) — a readable case study in how stylometry works in practice. The method used: function word frequencies, sentence length distributions, and character n-grams.
  • Unmasking by Koppel, Schler & Argamon (2007) — the technique of training a classifier on rolling windows of text and watching accuracy degrade as the window moves away from the author's core style.

AI detection

  • Mitchell et al. — DetectGPT (2023). Uses the observation that LLM-generated text sits near local maxima of the model's log-probability surface. Human text does not. Requires model access; not client-side.
  • Kirchenbauer et al. — A Watermark for Large Language Models (2023). Proposes embedding a statistical watermark in LLM output by biasing token selection. Detectable without the model, undetectable to human readers.
  • The loper-os predictor (linked below) remains one of the most elegant demonstrations of what a simple predictor can reveal about human entropy — built decades before LLMs.

Building real forensic tools

  • Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. — The Development and Psychometric Properties of LIWC2015 (2015). The standard word-count psycholinguistic dictionary: 90+ categories, thousands of words, validated against human judges. What Decision Signature approximates with 20 words per category.
  • Grootendorst, M. — BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022). Topic discovery from embeddings rather than word counts — discovers dimensions the analyst did not specify in advance.
  • Devlin, J. et al. — BERT: Pre-training of Deep Bidirectional Transformers (2018). The backbone of modern semantic similarity and authorship embedding approaches. The step beyond bag-of-words that captures "being careful" and "mitigation strategies" as the same concept.