Stylometrics

Four instruments for reading your own writing — Zipf fingerprint, predictability waveform, vocabulary drift, and cognitive framing analysis. A teaching tool for understanding how forensic stylometry works, not a production classifier.

Every writer leaves a fingerprint. Not in what they say — in how they say it: the words they reach for first, the rhythm their sentences fall into, the concepts they orbit without naming.

These patterns are below the threshold of conscious editing. You cannot choose your Zipf distribution any more than you can choose your gait.

Or you can just look for the emdashes (I intentionally left it there, or maybe this is CLaude writing this post?)

Stylometrics is the science of reading those patterns. It was used to resolve the authorship of the Federalist Papers. It surfaces in forensic linguistics, plagiarism detection, and authorship attribution. The same instruments that identify a forger can help a writer see themselves.

These instruments are naive classifiers — fixed word lists, simple frequency counts, hand-chosen dimensions. They demonstrate the logic of forensic pattern-mining and give you something to interact with. They are not what a real forensics pipeline looks like. A production system would use transformer embeddings, topic models trained on thousands of documents, and classifiers validated against labelled ground truth. What you see here is the concept, not the capability.

These tools are entirely client-side. Nothing you paste here is transmitted anywhere.

Stylometry

Quick load:

What each instrument measures

Instrument	What it does	Key signal
① Zipf Mirror	Plots word frequency distribution on a log-log scale — the shape all natural language takes	Words above the ideal line appear more often than their rank predicts — those are your fingerprint
② Predictability Scanner	Computes per-sentence information content from your own vocabulary	High-surprise sentences use words rare relative to your text — that is where you were present, not on autopilot
③ Vocabulary Drift	Compares two writing samples from different periods	New words entering your top vocabulary and old ones dropping out measure how much Period A resembles Period B
④ Decision Signature	Scans for word-level markers across four cognitive framing dimensions: certainty, time horizon, individual vs collective, loss sensitivity	Which pole each dimension's vocabulary leans toward — four dimensions chosen to be illustrative, not authoritative

Instrument glossary

Instrument	What it computes	Signal strength	Where it breaks down
Zipf Mirror	Log-log rank-frequency curve; Zipf exponent α; words above the ideal line	Topic-independent — works across any subject matter	LLMs mimic the global distribution — α converges with human text
Predictability Scanner	Per-sentence information content (−log₂ p(word)) using the text's own vocabulary	Partially topic-independent — uses in-text vocabulary, not a fixed dictionary	Short texts have unstable probability estimates
Vocabulary Drift	Zipf-distinctive content words compared across two periods; overlap coefficient	Same-domain texts only — measures preoccupation shift, not topic shift	Comparing different-topic texts inflates drift score regardless of authorship
Decision Signature	Word-list frequency scan across four cognitive dimensions: certainty markers, temporal framing, pronoun type, gain/loss vocabulary	Partially topic-independent — lexical, not syntactic	Small fixed word lists; easily defeated by domain vocabulary or deliberate word choice; dimensions are illustrative, not empirically derived

Key terms:

Zipf exponent α — steepness of the rank-frequency slope. Natural language clusters around 1.0. Higher α means vocabulary is concentrated on fewer words. Lower α means broader spread.
Type-Token Ratio (TTR) — unique words ÷ total words. A rough proxy for lexical diversity. Length-dependent: longer texts always produce lower TTR, so only compare samples of similar length.
Information content (IC) — how surprising a word is: −log₂(count/total). A word appearing once in 1,000 carries ~10 bits. A word appearing 100 times in 1,000 carries ~3.3 bits.
Characteristic words — content words that appear at 15%+ above their expected Zipf frequency. These are the words you reach for more than the baseline predicts — your actual fingerprint.
Overlap coefficient — |A ∩ B| ÷ min(|A|, |B|). 0 = no shared words, 1 = one set is a subset of the other. Used in Vocabulary Drift to measure how much the fingerprint changed.

How much text do you need?

The relationship between sample size and analytical quality is non-linear. The first 500 words buy most of the signal. Past ~5,000 words, adding more text changes nothing.

Instrument	Too little — noise	Minimum viable	Reliable	Saturates
Zipf Mirror	< 100 words	~200 words	~1,000 words	~5,000 words
Predictability Scanner	< 5 sentences	10–20 sentences (~200 words)	50+ sentences (~750 words)	~200 sentences
Vocabulary Drift (per period)	< 300 words	~600 words	~1,500 words	~5,000 words
Decision Signature	< 80 words	~150 words	~500 words	~2,000 words (word-list saturation)

What "too little" looks like in practice:

Zipf Mirror with < 100 words: only ~50–80 unique words, Zipf fit R² drops below 70%, distinctive word list is empty or arbitrary.
Predictability Scanner with 3–4 sentences: the vocabulary probability estimates are too sparse — every word looks equally surprising.
Vocabulary Drift with short texts: fewer than 8 characteristic words are found, so the overlap is either 0% or 100% by chance. The tool falls back to raw top-N in this case, which is the less meaningful comparison.

What "too much" looks like:

There is no accuracy penalty for longer texts — but past saturation the numbers stop moving. A 50,000-word corpus gives you the same Zipf α as a 10,000-word one from the same author. The marginal cost is milliseconds of compute, not quality.

The topic confound in Vocabulary Drift:

Vocabulary Drift measures what you write about as much as how you write. If Period A is economics and Period B is personal essays, the drift score reflects the domain change, not the author's evolution.

For a meaningful reading: use samples from the same subject area, different time periods. The sample loader uses two economics posts from different years as a baseline. For your own writing, sort by tag before comparing.

What should I look for?

There is no right number to cross. There is no threshold above which a drift score becomes meaningful or below which a Zipf α becomes suspicious.

Anyone who gives you one is an overconfident idiot who is overfitting to a specific dataset in a specific context.

Principle	What it means in practice
Never use a single number	A Zipf exponent of 0.62 is not evidence of anything on its own. Forensic analysts look for convergence — multiple independent instruments pointing the same direction simultaneously. One unusual score is noise. Three unusual scores across different methods, on the same text, in the same direction, begins to be signal.
The baseline is you, not a universal standard	Your Zipf α of 0.66 is meaningful only relative to your other texts — not to some abstract "human range." A forensic examiner builds a reference corpus of known writing before drawing conclusions from disputed material. These tools tell you something useful only after you've run several of your own texts and seen your normal range. Deviations from your own baseline matter. Deviations from a population average do not.
Express probability, not proof	A real forensic report does not say "this text was written by Person X." It says "the features are consistent with Person X's known writing and not inconsistent with the hypothesis." That careful language reflects that authorship attribution has a known false-positive rate and must survive cross-examination. A high drift score means the vocabulary changed. It does not mean you became a different person.
Control for confounds explicitly	Before comparing two texts, document: What is the genre of each? Is one formal, one informal? Are the topics different? Is one heavily edited? Each uncontrolled confound is a possible alternative explanation for any finding. Vocabulary Drift between an economics post and a travel essay tells you the topic changed — nothing more.
Applied to these tools	Run them on your own writing first. Build a personal baseline across several pieces. Then use the instruments comparatively — not to produce a verdict, but to raise questions worth investigating by hand. The instrument surfaces the anomaly. You supply the interpretation.

What to watch for — without reading

These patterns are detectable in seconds on any text, before you've processed the meaning:

Tell	What to look for	Human vs AI
Sentence rhythm	Skim paragraph lengths and sentence lengths	Human: high variance — one-sentence paragraphs next to six-sentence ones. AI: metronomically even, 3–4 sentences per paragraph, 15–25 words per sentence. Coefficient of variation (std ÷ mean): 0.3–0.5 for AI, 0.7–1.2 for human
Transition word density	Scan for: however, furthermore, notably, importantly, additionally, it is worth noting, in conclusion, to summarize	Human: sparse and inconsistent. AI: structural scaffolding, often once per paragraph at the opening. Three or more in a 500-word passage is a tell.
Em-dash usage	Count em-dashes	Human: used for rhythm and interruption. AI (especially post-2023): used as a clause separator everywhere, at 3–8× the human baseline
Hedging without stakes	Read the uncertainty language	Human hedging is specific: "I'm not sure about the 2019 numbers." AI hedging is generic: "it is important to note that," "this is a complex topic." The difference is whether the uncertainty points at something concrete.
Distinctive word character	Look at what words appear repeatedly beyond common English	Human: "that jestergooner got mogged by a foid goycel, spiking his cortisol." AI: abstract nouns and evaluative adjectives — nuanced, robust, comprehensive, framework, landscape, ecosystem, paradigm. "You are absolutely right!"

What a real forensics pipeline looks like

The instruments on this page are pedagogically useful but technically naive. Here is the gap between what you see here and what a production system would use:

Component	What this page uses	What production uses
Topic discovery	4 hand-chosen dimensions (certainty, time, frame, loss)	BERTopic, LDA, or NMF trained on thousands of documents — discovers dimensions empirically from the data. Discovered topics may number in the dozens and may not correspond to any human-named category.
Vocabulary coverage	~20 words per pole	LIWC (Linguistic Inquiry and Word Count): 90+ categories, thousands of words, validated against human judges. Or dense LLM embeddings that capture semantic proximity — "mitigation strategies" and "being careful" both express caution; word-list matching catches one, embeddings catch both.
Classifier training	hi-word count ÷ total — a bag-of-words ratio, same logic as a 1990s spam filter	Trained end-to-end on labelled author pairs, cross-validated to estimate false-positive rates, with held-out test sets. Weights reflect what actually predicts authorship in training data, not what a human assumed would matter.
Adversarial robustness	Not tested against adversarial examples	Real forensic tools are red-teamed. Every instrument on this page can be defeated in under a minute: delete hedging words to move the certainty score, pad with high-information sentences to flatten the predictability waveform.
Right expectation	Illustrative numbers, not authoritative verdicts	These instruments explain the approach — writing carries measurable statistical regularities that differ between authors and between human and AI writers. The idea is real and the field is active. The specific numbers this page produces are illustrative, not authoritative.