Hidden activations know when self-report does not

A linear readout of layer-19 residual activations recovers claim-level correctness signal that Llama 3.1 8B's structured self-report misses across legal QA and biography factuality.

STEPHEN CASELLA · APRIL 2026

GitHub repo

TL;DR

Llama 3.1 8B's structured self-report does not reliably reveal which claims in a long-form answer are unreliable. On MAUD merger-agreement QA under a GPT-5.4 judge, self-report reaches 0.511 AUROC; a logistic regression on layer-19 residual activations reaches 0.771, paired delta +0.260 [0.137, 0.387]. On FActScore biographies under human atomic-fact labels, self-report reaches 0.541 and the residual probe reaches 0.802, paired delta +0.262 [0.219, 0.302].

The result is not "the model knows truth." It is more specific. Hidden activations carry a recoverable claim-level correctness signal that the model's verbal confidence does not surface, and the effect size is nearly identical across two settings that differ in label source, generation source, and content domain. The boundary matters too. FELM points the same way but is statistically inconclusive, and its matched-generation repair shows that reference-bounded labels can collapse into evidence-coverage measurements once generated claims exceed the reference scope.

When self-report is not enough

Long answers fail locally. One sentence is right, the next is partly right, the next is confidently wrong. A user who depends on the answer needs to know which is which. A single answer-level confidence number is not enough.

The simplest claim-level baseline is to ask the model. Show Llama 3.1 8B a fixed claim from its own answer and ask for structured confidence. If self-report ranks correct claims above incorrect ones, a probe-based system is wasted complexity. If it does not, the alternative has to come from somewhere else inside the same forward pass.

The question

Does Llama's structured self-report rank claim correctness? And if not, does a linear readout of its hidden activations on the same claims recover that signal? In the settings studied here the answers are no and yes, and the gap is large.

The probe is deliberately simple: layer-19 residual-stream vectors pooled over the claim span, standardized, fed to L2 logistic regression. No attention head selection, no SAE feature search, no learned pooling. The point is to test whether the basic geometric assumption holds before reaching for anything more expressive.

Setup — one probed model, three label regimes

The probed model is always Llama 3.1 8B Instruct. What changes is the dataset and label source. MAUD provides 150 fixed legal claims scored by two LLM judges. FActScore provides 4,886 human-labeled atomic facts mapped to parent biography sentences. FELM provides human segment labels on world-knowledge QA, where the matched-generation repair eventually exposed an annotation-target problem.

01 · MAUD Legal QA, judge-proxy labels 150 frozen claim units from merger-agreement QA. GPT-5.4 produces the primary correctness target; Kimi K2.6 supplies an independent second-judge sensitivity pass on the same fixed claims.

02 · FActScore Biographies, human labels 157 ChatGPT biographies, 4,886 non-IR human atomic facts. Atomic facts almost never appear verbatim in the generation, so activations are pooled over the parent annotated sentence span.

03 · FELM-wk Boundary case Human-annotated world-knowledge segments. Probes are directionally above self-report but the paired delta crosses zero, and the matched-generation repair returns 30 of 52 segments as not_enough_evidence.

04 · Baseline Structured self-report Llama is shown the fixed claim and asked for structured confidence. The baseline is the easy version of self-report: failures cannot be blamed on implicit claim boundaries or unstructured confidence language.

The two protocols differ because the sample sizes do. MAUD's 150 claims do not support a clean train/validation/test split, so the headline MAUD probe uses leave-one-out evaluation over the frozen full set. FActScore's 4,886 facts comfortably support a 70/15/15 split by biography, with 3,353 train, 791 validation, and 742 test claims. Both choices are conventional for their respective regimes; the comparison the article makes is over the qualitative gap, not over identical pipelines.

The headline — same gap, two domains

The simplest evaluation is whether each score ranks correct claims above non-correct claims. AUROC is 0.5 at chance. Higher is better. Brier score on the same scores measures how well the probability magnitudes match the binary label.

MAUD ΔAUROC

+0.260

Residual probe minus self-report, paired bootstrap [0.137, 0.387]

FActScore ΔAUROC

+0.262

Same direction, [0.219, 0.302], n=742 test facts

FELM ΔAUROC

+0.141

[-0.063, 0.320] — interval crosses zero

Self-report ceiling

0.541

Best AUROC any dataset; barely above chance

Figure 1. Residual activations carry substantially more claim-level correctness signal than Llama self-report on MAUD and FActScore. The two paired AUROC deltas are within 0.002 of each other despite the protocols differing in label source, generation source, and content domain. FELM is directionally consistent but treated as a boundary case.

Table 01Cross-dataset probe vs self-report comparison

Dataset	Label source	Self-report	Residual probe	Paired Δ
MAUD	GPT-5.4 judge proxy	0.511 [0.411, 0.597]	0.771 [0.687, 0.838]	+0.260 [0.137, 0.387]
FActScore	Human atomic facts	0.541 [0.514, 0.570]	0.802 [0.769, 0.833]	+0.262 [0.219, 0.302]
FELM-wk	Human segments	0.511 [0.393, 0.673]	0.652 [0.493, 0.815]	+0.141 [-0.063, 0.320]

Why this is a replication, not a copy

MAUD uses leave-one-out over a small judge-labeled legal claim set. FActScore uses a 70/15/15 split by biography over a much larger human-labeled biography set. The protocols are not identical, and they should not be: 150 rows do not support what 4,886 do. What replicates is the qualitative gap between self-report and residual activations, with paired effect sizes of +0.260 and +0.262 AUROC. Treat the agreement as evidence about the signal, not the pipeline.

Second judge — does the ordering survive?

MAUD's first labels come from a GPT-5.4 judge-proxy protocol. Probes trained on those labels could be fitting whatever GPT-5.4 happens to score consistently rather than something more general. The cleanest test is to relabel the same fixed claims with a different judge family and rescore the same probes without retraining. We use Kimi K2.6 through Prime Intellect Inference, with the prompt structure and claim serialization preserved.

Figure 2. Method ranking survives the judge-family change: GPT-5.4 scorer > residual probe > SAE probe > self-report under both judges. The GPT-5.4 scorer loses 0.072 AUROC moving to the independent judge, the cleanest evidence of same-family scorer-judge coupling. Probes also drop, but probes were never trained against the second judge.

Table 02MAUD method comparison with 1000-resample paired CIs

Method	GPT-5.4 AUROC	Kimi AUROC	Agreement-set AUROC	Agreement Brier
Llama self-report	0.511 [0.411, 0.597]	0.466 [0.413, 0.604]	0.486 [0.392, 0.598]	0.509 [0.400, 0.600]
GPT-5.4 scorer	0.944 [0.906, 0.984]	0.872 [0.807, 0.926]	0.981 [0.931, 1.000]	0.101 [0.064, 0.144]
Residual probe	0.771 [0.687, 0.838]	0.707 [0.627, 0.792]	0.793 [0.709, 0.870]	0.231 [0.166, 0.299]
SAE probe	0.677 [0.582, 0.763]	0.652 [0.563, 0.735]	0.712 [0.616, 0.812]	0.268 [0.193, 0.350]

Where the judges actually disagree

The two judges agree on 110 of 150 claims. Cohen's kappa is 0.572 [0.456, 0.682]. The disagreement is not uniform across the rubric. Stratifying by GPT-5.4's label shows where the boundary is hardest.

Table 03MAUD judge agreement, conditioned on GPT-5.4 label

GPT-5.4 label	n claims	Kimi agrees	Kimi disagrees	Agreement rate
true	62	54	8	0.871
partially_true	68	41	27	0.603
false	20	15	5	0.750
full set	150	110	40	0.733

The judges agree most on clearly true claims, less on clearly false claims, and least on partially-true claims. That is exactly where the rubric boundary should be hardest, and it is the regime that drives the kappa down toward 0.57. The 110-claim agreement subset is therefore not just a robustness slice; it is the most interpretable subset MAUD has, because it strips out labels where two different judges meaningfully disagree.

Sanity check

On the 110-claim agreement set, every strong method gets cleaner: GPT-5.4 scoring rises to 0.981, the residual probe rises to 0.793, the SAE probe rises to 0.712. If a probe were primarily fitting GPT-5.4's quirks, restricting evaluation to claims another judge family also accepts should not help. The improvement is consistent with probes tracking signal that survives judge disagreement.

FActScore probe diagnostics — not a fragile fit

FActScore changes everything except the probed model: human atomic-fact labels instead of LLM judges, ChatGPT generations instead of Llama generations, biography factuality instead of merger-agreement QA, atomic facts pooled to parent sentence spans instead of extracted legal claims. Despite all of that, the residual probe's held-out AUROC lands at 0.802. The diagnostic question is whether that test number reflects a real generalization or a happy validation set.

Figure 3. Probe AUROC across the FActScore train, validation, and test splits. Test above validation is not the headline finding; it follows from biographies in different splits varying in difficulty and fact density. The C=0.01 model is regularized hard enough that the small absolute test-validation gap is consistent with the across-biography variation, not overfit on validation.

The Brier scores tell the same story without the rank-only filter. Self-report Brier is 0.298, which means the probabilities Llama emits are barely better than always saying 0.5; the residual probe Brier is 0.171, reaching well past calibration parity. AUROC and Brier disagreeing would be a flag for rank-vs-magnitude trade-off; here they agree, and the probe wins on both.

The boundary — when labels measure coverage

FELM-wk is the third dataset, and it is the cleanest case for the cautionary half of the article. The minimum-viable run reaches 0.652 residual probe AUROC against 0.511 for self-report, with paired delta +0.141 [-0.063, 0.320]. Direction is positive. The interval crosses zero. That is what "directionally consistent but inconclusive" looks like in numbers.

The interesting finding in FELM is not the held-out AUROC. It is what happened when we tried to repair the annotation. Llama-generated FELM-style answers go beyond FELM's curated reference snippets often enough that calling them false against those snippets is the wrong target. We added fuller readability-extracted references and a fourth not_enough_evidence label. The repaired pilot's 52 segments came back like this:

true

7.7% of pilot segments

partially_true

23.1%

false

11.5%; 28 of the old pilot's 43 false labels moved

not_enough_evidence

57.7% — reference scope is the bottleneck

not_enough_evidence at 57.7% is not noise. It is a structural finding: when generated claims exceed the reference scope, a reference-bounded annotator cannot adjudicate them. The label tells us whether the bundle supports the claim, not whether the claim is true. That is a coverage measurement, not a correctness measurement, and it is the wrong target for a probe trained on claim correctness. We therefore did not scale the matched-generation FELM run; instead, FELM stays in the paper as a documented boundary that explains why label provenance matters more than domain similarity.

The methodological lesson

The strongest lesson is not "legal QA transfers to biography QA." It is that label provenance dominates. MAUD works because fixed claims are judged by a clear rubric. FActScore works because human atomic-fact labels directly capture factual support. FELM is weaker because segment labels and reference coverage are coarser than the claim-level correctness target the probe needs. Claim construction and label provenance affect probe trainability as much as the probe architecture does.

For a probe-based uncertainty system, the hard questions come before the classifier ever runs. What is the claim? What does the label actually capture? Does the activation span match the label unit? When those three line up, a regularized linear readout of the residual stream recovers a sizeable correctness signal. When they do not, a stronger probe will mostly learn a muddier target.

Bounds on the claim

Honest scope

Does say: residual activations recover claim-level correctness signal that structured Llama self-report misses on MAUD and FActScore. The paired AUROC delta lands at +0.260 and +0.262 across the two settings.
Does not say: probes detect legal or factual truth in an absolute sense. MAUD remains judge-proxy evidence until the frozen 30-claim human audit is analyzed.
Does say: the GPT-5.4 external scorer is the strongest MAUD method, but its 0.944 AUROC inflates from same-family scorer-judge coupling. The defensible estimate is 0.872 under the independent judge.
Does not say: residual probes reliably beat SAE probes. Residual is directionally stronger, but residual-minus-SAE intervals cross zero under judge 2 and on the agreement set.
Does not say: the residual correctness direction decomposes into a small number of Goodfire layer-19 SAE features. A separate analysis (probe_interpretation.md) found the signal distributed; the probe is a useful behavioral readout, not a compact mechanism story.
Does not say: the result generalizes to other model families, other layers, or other claim extractors. All probes are layer 19 of Llama 3.1 8B Instruct. CUAD failed at the claim-extraction step before this protocol could run.

What's next

Three follow-ups would most sharpen the claim, in roughly this order.

Close the human-audit loop

The 30-claim MAUD audit packet is frozen at data/annotations/maud_human_audit_packet.jsonl. Completed expert labels would convert MAUD's claim from "stable across two LLM judges" to "stable across two LLM judges, anchored to expert legal labels," or expose a systematic disagreement worth knowing about. Either outcome is publishable.

Matched-generation transfer that does not collapse

FActScore weakens the generation-mismatch worry but does not eliminate it. A matched-generation run on a dataset where Llama's claims do not routinely outrun the reference scope would close that gap. The FELM repair shows the failure mode to avoid: reference-bounded labels driven into not_enough_evidence.

Layer sweep and second model family

All probes are layer 19, chosen because the Goodfire SAE was trained there. A small layer sweep would document how localized the signal is. A second model family at 7-13B parameters would say whether the gap is a Llama representation quirk or a more general property of instruction-tuned residual streams.

· · ·

The result is a bounded interpretability win. A linear readout of hidden activations sees something the model's own structured confidence misses, and the gap is the same size on legal QA and on biography factuality. It does not give us a compact SAE feature story, a legal-truth adjudicator, or a domain-general hallucination detector.

The remaining questions sit on label provenance and matched-generation transfer, not on probe architecture. Downstream tools building on this should expose claim construction and label-provenance assumptions to users, rather than hiding them behind opaque confidence scores. The probe is only as honest as the labels it was trained on.