Why Personality Tests Mischaracterize LLM Behavior

Quick answer

Scoring an LLM on a human personality or values questionnaire tells you almost nothing about how it behaves in ordinary queries. Across eight open models, the rank agreement between Likert self-reports and behavior measured from generation probabilities on everyday prompts was only Spearman 0.31 for values (PVQ) and 0.26 for the Big Five (BFI). For reference, two versions of the same survey agree at 0.74-0.77. The questionnaire profile is internally tidy but behaviorally hollow, because the items leak obvious lexical cues that let the model answer the way it “should.”

What the study actually measures

The Seoul National University team profiles each model two ways and checks whether the two pictures match. The first is the standard trick the LLM-personality literature uses: hand the model a validated human instrument — the 40- and 21-item Portrait Values Questionnaire (PVQ) and the 44- and 10-item Big Five Inventory (BFI) — and read off its Likert self-ratings. The second is meant to approximate real use: present everyday, value-laden situations and read the model’s generation probabilities over response options that lean toward one trait or another. If questionnaires were valid proxies, the two profiles would line up. They do not.

The four findings, with numbers

Profiles diverge. Construct-ranking agreement between self-report and generation is Spearman 0.31 (PVQ) and 0.26 (BFI). The within-method ceiling — PVQ-40 vs PVQ-21, BFI-44 vs BFI-10 — sits at 0.74 and 0.77. So the method you pick moves the answer more than the trait you measure.
Item consistency evaporates. On the real questionnaires, items that belong to the same construct cluster tightly: eta-squared is 0.526. On generation probabilities for realistic prompts it collapses to 0.040. The neat factor structure psychometrics depends on simply is not there in behavior.
Models recognize survey items, not real situations. Asked to map an item back to its intended construct, models hit F1 of 0.69-0.83 on established PVQ/BFI items but 0.09 on realistic “Value Portrait” scenarios. The survey wording is a giveaway; real life is not.
Personas exaggerate on paper only. Steering a model with a demographic persona shifts its questionnaire answers in human-like directions (direction match 62/80, p < 0.001) but barely moves its actual generation behavior (40/80, p = 0.54, i.e. chance).

Why the questionnaire lies

The mechanism is the interesting part, and it is not “models have no personality.” It is that questionnaire items contain explicit lexical cues — words like “It is important to him to be loyal” — that announce which trait is being probed. A model can pattern-match the socially desirable answer without that disposition driving anything. Strip the cue away, as a genuine user query does, and the apparent trait stops predicting the model’s choices. The instrument measures cue-recognition dressed up as personality.

Why it matters

A growing line of work assigns LLMs Big Five scores, builds “personality-controlled” agents from survey prompts, and audits model “values” with PVQ. This paper is direct evidence that those scores can be construct-valid in the human sense — consistent, factorable, persona-responsive — and still fail to predict behavior on the only thing that matters, real interactions. My honest read: if you are using a questionnaire to certify that an aligned model “holds” certain values, you are likely measuring its sensitivity to leading wording, not its conduct. Behavioral probes on realistic prompts are the harder but more trustworthy bar.

Key results

Self-report vs generation rank agreement: Spearman 0.31 (PVQ values), 0.26 (BFI personality).
Within-survey agreement (the valid-method baseline): 0.74 (PVQ-40 vs PVQ-21), 0.77 (BFI-44 vs BFI-10).
Within-construct item consistency: eta-squared 0.526 on questionnaires vs 0.040 on real-query generation (p < 0.01).
Construct recognition F1: 0.69-0.83 on established items vs 0.09 on realistic scenarios.
Persona steering direction match: 62/80 on PVQ-40 (p < 0.001) vs 40/80 on realistic prompts (p = 0.54).
Models: Gemma 3 (4B, 27B), Qwen 2.5 (7B, 72B), Qwen 3 (30B-A3B, 235B-A22B), GPT-OSS (20B, 120B).

Limits and open questions

The study covers eight open models from four families; closed frontier models (GPT-4-class, Claude, Gemini) are absent, and they may behave differently under either probe. “Generation probability on value-laden options” is itself a constructed measure, not raw deployment behavior, so it is a better proxy than a survey but still a proxy. The work shows questionnaires fail to predict behavior, not that LLMs have no stable dispositions at all — a behaviorally grounded instrument might still recover something real. And the everyday-query scenarios are curated by the authors, so coverage and cultural framing are open. What it does establish cleanly: a tidy survey score is not evidence of how a model will act.

FAQ

Do personality questionnaires like the Big Five work on LLMs?

Not as behavior predictors. This study found BFI self-report profiles agreed with real-query behavior at only Spearman 0.26, versus 0.77 between two BFI versions. The score is internally consistent but does not track what the model actually does.

Why do psychometric questionnaires mischaracterize LLM behavior?

Because survey items contain explicit lexical cues that signal which trait is being measured, so the model picks the socially desirable answer (F1 0.69-0.83 at recognizing the construct). Real user queries lack those cues (F1 0.09), and the apparent trait stops predicting behavior.

Which LLMs were tested in this paper?

Eight open models across four families: Gemma 3 (4B, 27B), Qwen 2.5 (7B, 72B), Qwen 3 (30B-A3B and 235B-A22B MoE), and GPT-OSS (20B, 120B).

Should I use PVQ or BFI to audit an LLM’s values?

For certifying behavior, no. The paper shows persona-steered questionnaire answers shift in human-like directions (62/80, p < 0.001) while real generation barely moves (40/80, p = 0.54). Use behavioral probes on realistic prompts instead.

One line: a clean Big Five or values score on an LLM measures its sensitivity to leading wording, not how it behaves. Read the original paper on arXiv.