Stylometrics
Four instruments for reading your own writing — Zipf fingerprint, predictability waveform, vocabulary drift, and cognitive framing analysis. A teaching tool for understanding how forensic stylometry works, not a production classifier.
Every writer leaves a fingerprint. Not in what they say — in how they say it: the words they reach for first, the rhythm their sentences fall into, the concepts they orbit without naming.
These patterns are below the threshold of conscious editing. You cannot choose your Zipf distribution any more than you can choose your gait.
Or you can just look for the emdashes (I intentionally left it there, or maybe this is CLaude writing this post?)
Stylometrics is the science of reading those patterns. It was used to resolve the authorship of the Federalist Papers. It surfaces in forensic linguistics, plagiarism detection, and authorship attribution. The same instruments that identify a forger can help a writer see themselves.
These instruments are naive classifiers — fixed word lists, simple frequency counts, hand-chosen dimensions. They demonstrate the logic of forensic pattern-mining and give you something to interact with. They are not what a real forensics pipeline looks like. A production system would use transformer embeddings, topic models trained on thousands of documents, and classifiers validated against labelled ground truth. What you see here is the concept, not the capability.
These tools are entirely client-side. Nothing you paste here is transmitted anywhere.
Stylometry
Period A — earlier
Period B — recent
What each instrument measures
| Instrument | What it does | Key signal |
|---|---|---|
| ① Zipf Mirror | Plots word frequency distribution on a log-log scale — the shape all natural language takes | Words above the ideal line appear more often than their rank predicts — those are your fingerprint |
| ② Predictability Scanner | Computes per-sentence information content from your own vocabulary | High-surprise sentences use words rare relative to your text — that is where you were present, not on autopilot |
| ③ Vocabulary Drift | Compares two writing samples from different periods | New words entering your top vocabulary and old ones dropping out measure how much Period A resembles Period B |
| ④ Decision Signature | Scans for word-level markers across four cognitive framing dimensions: certainty, time horizon, individual vs collective, loss sensitivity | Which pole each dimension's vocabulary leans toward — four dimensions chosen to be illustrative, not authoritative |
Instrument glossary
| Instrument | What it computes | Signal strength | Where it breaks down |
|---|---|---|---|
| Zipf Mirror | Log-log rank-frequency curve; Zipf exponent α; words above the ideal line | Topic-independent — works across any subject matter | LLMs mimic the global distribution — α converges with human text |
| Predictability Scanner | Per-sentence information content (−log₂ p(word)) using the text's own vocabulary | Partially topic-independent — uses in-text vocabulary, not a fixed dictionary | Short texts have unstable probability estimates |
| Vocabulary Drift | Zipf-distinctive content words compared across two periods; overlap coefficient | Same-domain texts only — measures preoccupation shift, not topic shift | Comparing different-topic texts inflates drift score regardless of authorship |
| Decision Signature | Word-list frequency scan across four cognitive dimensions: certainty markers, temporal framing, pronoun type, gain/loss vocabulary | Partially topic-independent — lexical, not syntactic | Small fixed word lists; easily defeated by domain vocabulary or deliberate word choice; dimensions are illustrative, not empirically derived |
Key terms:
- Zipf exponent α — steepness of the rank-frequency slope. Natural language clusters around 1.0. Higher α means vocabulary is concentrated on fewer words. Lower α means broader spread.
- Type-Token Ratio (TTR) — unique words ÷ total words. A rough proxy for lexical diversity. Length-dependent: longer texts always produce lower TTR, so only compare samples of similar length.
- Information content (IC) — how surprising a word is: −log₂(count/total). A word appearing once in 1,000 carries ~10 bits. A word appearing 100 times in 1,000 carries ~3.3 bits.
- Characteristic words — content words that appear at 15%+ above their expected Zipf frequency. These are the words you reach for more than the baseline predicts — your actual fingerprint.
- Overlap coefficient — |A ∩ B| ÷ min(|A|, |B|). 0 = no shared words, 1 = one set is a subset of the other. Used in Vocabulary Drift to measure how much the fingerprint changed.
How much text do you need?
The relationship between sample size and analytical quality is non-linear. The first 500 words buy most of the signal. Past ~5,000 words, adding more text changes nothing.
| Instrument | Too little — noise | Minimum viable | Reliable | Saturates |
|---|---|---|---|---|
| Zipf Mirror | < 100 words | ~200 words | ~1,000 words | ~5,000 words |
| Predictability Scanner | < 5 sentences | 10–20 sentences (~200 words) | 50+ sentences (~750 words) | ~200 sentences |
| Vocabulary Drift (per period) | < 300 words | ~600 words | ~1,500 words | ~5,000 words |
| Decision Signature | < 80 words | ~150 words | ~500 words | ~2,000 words (word-list saturation) |
What "too little" looks like in practice:
- Zipf Mirror with < 100 words: only ~50–80 unique words, Zipf fit R² drops below 70%, distinctive word list is empty or arbitrary.
- Predictability Scanner with 3–4 sentences: the vocabulary probability estimates are too sparse — every word looks equally surprising.
- Vocabulary Drift with short texts: fewer than 8 characteristic words are found, so the overlap is either 0% or 100% by chance. The tool falls back to raw top-N in this case, which is the less meaningful comparison.
What "too much" looks like:
There is no accuracy penalty for longer texts — but past saturation the numbers stop moving. A 50,000-word corpus gives you the same Zipf α as a 10,000-word one from the same author. The marginal cost is milliseconds of compute, not quality.
The topic confound in Vocabulary Drift:
Vocabulary Drift measures what you write about as much as how you write. If Period A is economics and Period B is personal essays, the drift score reflects the domain change, not the author's evolution.
For a meaningful reading: use samples from the same subject area, different time periods. The sample loader uses two economics posts from different years as a baseline. For your own writing, sort by tag before comparing.
What should I look for?
There is no right number to cross. There is no threshold above which a drift score becomes meaningful or below which a Zipf α becomes suspicious.
Anyone who gives you one is an overconfident idiot who is overfitting to a specific dataset in a specific context.
| Principle | What it means in practice |
|---|---|
| Never use a single number | A Zipf exponent of 0.62 is not evidence of anything on its own. Forensic analysts look for convergence — multiple independent instruments pointing the same direction simultaneously. One unusual score is noise. Three unusual scores across different methods, on the same text, in the same direction, begins to be signal. |
| The baseline is you, not a universal standard | Your Zipf α of 0.66 is meaningful only relative to your other texts — not to some abstract "human range." A forensic examiner builds a reference corpus of known writing before drawing conclusions from disputed material. These tools tell you something useful only after you've run several of your own texts and seen your normal range. Deviations from your own baseline matter. Deviations from a population average do not. |
| Express probability, not proof | A real forensic report does not say "this text was written by Person X." It says "the features are consistent with Person X's known writing and not inconsistent with the hypothesis." That careful language reflects that authorship attribution has a known false-positive rate and must survive cross-examination. A high drift score means the vocabulary changed. It does not mean you became a different person. |
| Control for confounds explicitly | Before comparing two texts, document: What is the genre of each? Is one formal, one informal? Are the topics different? Is one heavily edited? Each uncontrolled confound is a possible alternative explanation for any finding. Vocabulary Drift between an economics post and a travel essay tells you the topic changed — nothing more. |
| Applied to these tools | Run them on your own writing first. Build a personal baseline across several pieces. Then use the instruments comparatively — not to produce a verdict, but to raise questions worth investigating by hand. The instrument surfaces the anomaly. You supply the interpretation. |
What to watch for — without reading
These patterns are detectable in seconds on any text, before you've processed the meaning:
| Tell | What to look for | Human vs AI |
|---|---|---|
| Sentence rhythm | Skim paragraph lengths and sentence lengths | Human: high variance — one-sentence paragraphs next to six-sentence ones. AI: metronomically even, 3–4 sentences per paragraph, 15–25 words per sentence. Coefficient of variation (std ÷ mean): 0.3–0.5 for AI, 0.7–1.2 for human |
| Transition word density | Scan for: however, furthermore, notably, importantly, additionally, it is worth noting, in conclusion, to summarize | Human: sparse and inconsistent. AI: structural scaffolding, often once per paragraph at the opening. Three or more in a 500-word passage is a tell. |
| Em-dash usage | Count em-dashes | Human: used for rhythm and interruption. AI (especially post-2023): used as a clause separator everywhere, at 3–8× the human baseline |
| Hedging without stakes | Read the uncertainty language | Human hedging is specific: "I'm not sure about the 2019 numbers." AI hedging is generic: "it is important to note that," "this is a complex topic." The difference is whether the uncertainty points at something concrete. |
| Distinctive word character | Look at what words appear repeatedly beyond common English | Human: "that jestergooner got mogged by a foid goycel, spiking his cortisol." AI: abstract nouns and evaluative adjectives — nuanced, robust, comprehensive, framework, landscape, ecosystem, paradigm. "You are absolutely right!" |
What a real forensics pipeline looks like
The instruments on this page are pedagogically useful but technically naive. Here is the gap between what you see here and what a production system would use:
| Component | What this page uses | What production uses |
|---|---|---|
| Topic discovery | 4 hand-chosen dimensions (certainty, time, frame, loss) | BERTopic, LDA, or NMF trained on thousands of documents — discovers dimensions empirically from the data. Discovered topics may number in the dozens and may not correspond to any human-named category. |
| Vocabulary coverage | ~20 words per pole | LIWC (Linguistic Inquiry and Word Count): 90+ categories, thousands of words, validated against human judges. Or dense LLM embeddings that capture semantic proximity — "mitigation strategies" and "being careful" both express caution; word-list matching catches one, embeddings catch both. |
| Classifier training | hi-word count ÷ total — a bag-of-words ratio, same logic as a 1990s spam filter | Trained end-to-end on labelled author pairs, cross-validated to estimate false-positive rates, with held-out test sets. Weights reflect what actually predicts authorship in training data, not what a human assumed would matter. |
| Adversarial robustness | Not tested against adversarial examples | Real forensic tools are red-teamed. Every instrument on this page can be defeated in under a minute: delete hedging words to move the certainty score, pad with high-information sentences to flatten the predictability waveform. |
| Right expectation | Illustrative numbers, not authoritative verdicts | These instruments explain the approach — writing carries measurable statistical regularities that differ between authors and between human and AI writers. The idea is real and the field is active. The specific numbers this page produces are illustrative, not authoritative. |
Further reading
The field splits into classical authorship attribution (pre-LLM) and the newer AI-detection literature. They use different assumptions and different failure modes.
Foundations
- Mosteller & Wallace (1964) — Inference and Disputed Authorship: The Federalist. The paper that proved statistical stylometry works at scale. They identified Madison as the author of the disputed Federalist Papers using function word frequencies alone.
- George Zipf — Human Behavior and the Principle of Least Effort (1949). The original formulation. More interesting than the equation: the argument that word frequency distributions are an equilibrium between speaker effort and listener effort.
Authorship attribution
- Patrick Juola's work on unmasking J.K. Rowling as Robert Galbraith (2013) — a readable case study in how stylometry works in practice. The method used: function word frequencies, sentence length distributions, and character n-grams.
- Unmasking by Koppel, Schler & Argamon (2007) — the technique of training a classifier on rolling windows of text and watching accuracy degrade as the window moves away from the author's core style.
AI detection
- Mitchell et al. — DetectGPT (2023). Uses the observation that LLM-generated text sits near local maxima of the model's log-probability surface. Human text does not. Requires model access; not client-side.
- Kirchenbauer et al. — A Watermark for Large Language Models (2023). Proposes embedding a statistical watermark in LLM output by biasing token selection. Detectable without the model, undetectable to human readers.
- The loper-os predictor (linked below) remains one of the most elegant demonstrations of what a simple predictor can reveal about human entropy — built decades before LLMs.
Building real forensic tools
- Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. — The Development and Psychometric Properties of LIWC2015 (2015). The standard word-count psycholinguistic dictionary: 90+ categories, thousands of words, validated against human judges. What Decision Signature approximates with 20 words per category.
- Grootendorst, M. — BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022). Topic discovery from embeddings rather than word counts — discovers dimensions the analyst did not specify in advance.
- Devlin, J. et al. — BERT: Pre-training of Deep Bidirectional Transformers (2018). The backbone of modern semantic similarity and authorship embedding approaches. The step beyond bag-of-words that captures "being careful" and "mitigation strategies" as the same concept.
Image Forensics
Eight client-side instruments for reading image provenance — from EXIF metadata to frequency analysis, edge coherence, colour distribution, and texture statistics.
Suno AI Prompt Builder
A structured prompt builder for Suno AI music generation. Pick instruments, song sections, and transitions — get a copy-ready prompt optimized for Suno's Lyrics field.