April 2026

Hidden activations know when self-report does not

A classifier on Llama 3.1 8B's internal activations recovers claim-level correctness signal that the model's own confidence answers miss across legal QA and biography factuality.

TL;DR

Llama 3.1 8B misses unreliable claims in its own long-form answers when you ask it directly. A classifier trained on the model's internal activations does much better. On MAUD merger-agreement QA, asking the model reaches AUROC 0.511 (chance is 0.5); the classifier reaches 0.771, a paired bootstrap gain of +0.260 [0.137, 0.387]. On FActScore biographies under human atomic-fact labels, the gap reproduces: self-report 0.541, classifier 0.802, paired delta +0.262 [0.219, 0.302].

The narrow claim: hidden activations carry recoverable claim-level correctness signal that the model's verbal confidence does not surface. The effect size matches across two settings with different label sources, generation sources, and content domains. FELM points the same way but remains statistically inconclusive; its matched-generation repair shows reference-bounded labels collapsing into evidence-coverage measurements once generated claims exceed the reference scope.

Self-report misses local failure

Long answers fail claim by claim. One sentence is right, the next is partly right, the next is confidently wrong. A user who depends on the answer needs to know which is which. A single answer-level confidence number cannot do that job.

The simplest claim-level baseline is to ask the model. Show Llama 3.1 8B a fixed claim from its own answer and ask for structured confidence. If self-report ranks correct claims above incorrect ones, a probe-based system is wasted complexity. When self-report fails, the alternative has to come from somewhere else inside the same forward pass.

The question

Llama's structured self-report fails to rank claim correctness in these runs. A linear readout of its hidden activations on the same claims recovers that signal, with a large gap.

The probe stays simple: layer-19 residual-stream vectors pooled over the claim span, standardized, fed to L2 logistic regression. No attention head selection, no SAE feature search, no learned pooling. I wanted to test the basic geometric assumption before reaching for anything more expressive.

Setup: one probed model, three label regimes

The probed model is Llama 3.1 8B Instruct throughout. The dataset and label source change. MAUD provides 150 fixed legal claims scored by two LLM judges. FActScore provides 4,886 human-labeled atomic facts mapped to parent biography sentences. FELM provides human segment labels on world-knowledge QA, where the matched-generation repair exposed an annotation-target problem.

01 · MAUD Legal QA, judge-proxy labels 150 frozen claim units from merger-agreement QA. GPT-5.4 produces the primary correctness target; Kimi K2.6 supplies an independent second-judge sensitivity pass on the same fixed claims.
02 · FActScore Biographies, human labels 157 ChatGPT biographies, 4,886 non-IR human atomic facts. Atomic facts almost never appear verbatim in the generation, so activations are pooled over the parent annotated sentence span.
03 · FELM-wk Boundary case Human-annotated world-knowledge segments. Probes beat self-report in the point estimate, but the paired delta crosses zero, and the matched-generation repair returns 30 of 52 segments as not_enough_evidence.
04 · Baseline Structured self-report Llama is shown the fixed claim and asked for structured confidence. The baseline is the easy version of self-report: failures cannot be blamed on implicit claim boundaries or unstructured confidence language.

The two protocols differ because the sample sizes differ. MAUD's 150 claims do not support a clean train/validation/test split, so the headline MAUD probe uses leave-one-out evaluation over the frozen full set. FActScore's 4,886 facts support a 70/15/15 split by biography, with 3,353 train, 791 validation, and 742 test claims. I compare the qualitative gap, not identical pipelines.

Same gap, two domains

The simplest evaluation is whether each score ranks correct claims above non-correct claims. AUROC is 0.5 at chance. Higher is better. Brier score on the same scores measures how well the probability magnitudes match the binary label.

MAUD ΔAUROC
+0.260
Residual probe minus self-report, paired bootstrap [0.137, 0.387]
FActScore ΔAUROC
+0.262
Same direction, [0.219, 0.302], n=742 test facts
FELM ΔAUROC
+0.141
[-0.063, 0.320]; interval crosses zero
Self-report ceiling
0.541
Best AUROC any dataset; barely above chance
Claim-level AUROC: self-report vs residual probe MAUD and FActScore land at the same +0.26 paired delta. FELM points the same way but with an interval that crosses zero. 0.40 0.50 0.60 0.70 0.80 0.90 1.00 chance .511 .771 .541 .802 .511 .652 MAUD Δ +0.260 FActScore Δ +0.262 FELM-wk Δ +0.141 (CI ∋ 0) Llama self-report Residual probe Whiskers show 95% paired-bootstrap intervals (1000 resamples). FELM bars are dimmed because the paired delta interval includes zero.
Figure 1. Residual activations carry more claim-level correctness signal than Llama self-report on MAUD and FActScore. The two paired AUROC deltas are within 0.002 of each other despite the protocols differing in label source, generation source, and content domain. I treat FELM as a boundary case because its interval crosses zero.
Table 01Cross-dataset probe vs self-report comparison
Dataset Label source Self-report Residual probe Paired Δ
MAUD GPT-5.4 judge proxy 0.511 [0.411, 0.597] 0.771 [0.687, 0.838] +0.260 [0.137, 0.387]
FActScore Human atomic facts 0.541 [0.514, 0.570] 0.802 [0.769, 0.833] +0.262 [0.219, 0.302]
FELM-wk Human segments 0.511 [0.393, 0.673] 0.652 [0.493, 0.815] +0.141 [-0.063, 0.320]

Replication scope

MAUD uses leave-one-out over a small judge-labeled legal claim set. FActScore uses a 70/15/15 split by biography over a much larger human-labeled biography set. I do not force 150 rows to imitate 4,886. The replicated object is the qualitative gap between self-report and residual activations, with paired effect sizes of +0.260 and +0.262 AUROC. Treat the agreement as evidence about the signal rather than the pipeline.

Second judge sensitivity

MAUD's first labels come from a GPT-5.4 judge-proxy protocol. Probes trained on those labels could be fitting whatever GPT-5.4 happens to score consistently rather than something more general. The cleanest test is to relabel the same fixed claims with a different judge family and rescore the same probes without retraining. I use Kimi K2.6 through Prime Intellect Inference, with the prompt structure and claim serialization preserved.

MAUD judge sensitivity by method Method ordering survives the judge change. The GPT-5.4 scorer drops most under the independent judge. 0.40 0.50 0.60 0.70 0.80 0.90 1.00 chance .511 .466 Self-report −0.045 .677 .652 SAE probe −0.025 .771 .707 Residual probe −0.064 .944 .872 GPT-5.4 scorer −0.072 * GPT-5.4 judge Kimi K2.6 judge Δ below each pair = AUROC change moving from GPT-5.4 to Kimi labels. * GPT-5.4 scorer's drop has paired-bootstrap CI [0.019, 0.149].
Figure 2. Method ranking survives the judge-family change: GPT-5.4 scorer > residual probe > SAE probe > self-report under both judges. The GPT-5.4 scorer loses 0.072 AUROC moving to the independent judge, the cleanest evidence of same-family scorer-judge coupling. Probes also drop, but probes were never trained against the second judge.
Table 02MAUD method comparison with 1000-resample paired CIs
Method GPT-5.4 AUROC Kimi AUROC Agreement-set AUROC Agreement Brier
Llama self-report 0.511 [0.411, 0.597] 0.466 [0.413, 0.604] 0.486 [0.392, 0.598] 0.509 [0.400, 0.600]
GPT-5.4 scorer 0.944 [0.906, 0.984] 0.872 [0.807, 0.926] 0.981 [0.931, 1.000] 0.101 [0.064, 0.144]
Residual probe 0.771 [0.687, 0.838] 0.707 [0.627, 0.792] 0.793 [0.709, 0.870] 0.231 [0.166, 0.299]
SAE probe 0.677 [0.582, 0.763] 0.652 [0.563, 0.735] 0.712 [0.616, 0.812] 0.268 [0.193, 0.350]

Judge disagreement

The two judges agree on 110 of 150 claims. Cohen's kappa is 0.572 [0.456, 0.682]. Disagreement clusters around the rubric boundary. Stratifying by GPT-5.4's label shows the hard cases.

Table 03MAUD judge agreement, conditioned on GPT-5.4 label
GPT-5.4 label n claims Kimi agrees Kimi disagrees Agreement rate
true 62 54 8 0.871
partially_true 68 41 27 0.603
false 20 15 5 0.750
full set 150 110 40 0.733

The judges agree most on true claims, less on false claims, and least on partially-true claims. That boundary drives kappa down toward 0.57. The 110-claim agreement subset gives the cleanest MAUD slice because it removes labels where two judge families disagree.

Sanity check

On the 110-claim agreement set, every strong method gets cleaner: GPT-5.4 scoring rises to 0.981, the residual probe rises to 0.793, the SAE probe rises to 0.712. A probe fitting GPT-5.4's quirks should not improve when evaluation keeps only claims another judge family accepts. The improvement fits the interpretation that probes track signal surviving judge disagreement.

FActScore probe diagnostics

FActScore changes everything except the probed model: human atomic-fact labels instead of LLM judges, ChatGPT generations instead of Llama generations, biography factuality instead of merger-agreement QA, atomic facts pooled to parent sentence spans instead of extracted legal claims. Despite all of that, the residual probe's held-out AUROC lands at 0.802. The diagnostic question is whether that test number reflects a real generalization or a happy validation set.

FActScore residual probe by split Train AUROC 0.939 → validation 0.728 → test 0.802. Test above validation is consistent with biography-level split variation, not overfit. 0.40 0.50 0.60 0.70 0.80 0.90 self-report test .541 chance .500 .939 .728 .802 train 3,353 claims validation 791 claims test 742 claims Splits are by biography, not by atomic fact, to prevent fact leakage. C=0.01 selected on validation, then refit on train+validation.
Figure 3. Probe AUROC across the FActScore train, validation, and test splits. Test above validation is not the headline finding; it follows from biographies in different splits varying in difficulty and fact density. The C=0.01 model is regularized hard enough that the small absolute test-validation gap is consistent with the across-biography variation, not overfit on validation.

The Brier scores tell the same story without the rank-only filter. Self-report Brier is 0.298, which means the probabilities Llama emits are barely better than always saying 0.5; the residual probe Brier is 0.171, reaching well past calibration parity. AUROC and Brier disagreeing would be a flag for rank-vs-magnitude trade-off; here they agree, and the probe wins on both.

The boundary: labels can measure coverage

FELM-wk gives the cautionary case. The minimum-viable run reaches 0.652 residual probe AUROC against 0.511 for self-report, with paired delta +0.141 [-0.063, 0.320]. The point estimate is positive. The interval crosses zero.

The useful FELM finding came from the annotation repair. Llama-generated FELM-style answers exceed FELM's curated reference snippets often enough that scoring them false against those snippets targets the wrong object. I added fuller readability-extracted references and a fourth not_enough_evidence label. The repaired pilot's 52 segments came back like this:

true
4
7.7% of pilot segments
partially_true
12
23.1%
false
6
11.5%; 28 of the old pilot's 43 false labels moved
not_enough_evidence
30
57.7%; reference scope is the bottleneck

A 57.7% not_enough_evidence rate exposes the target mismatch. Generated claims can exceed the reference scope; then a reference-bounded annotator cannot adjudicate them. The label tells me whether the bundle supports the claim, not whether the claim is true. That gives a coverage measurement, not a correctness measurement, and it is the wrong target for a probe trained on claim correctness. I did not scale the matched-generation FELM run; FELM stays in the paper as a documented boundary for label provenance.

The methodological lesson

Label provenance dominates the transfer story. MAUD works because fixed claims are judged by a clear rubric. FActScore works because human atomic-fact labels capture factual support. FELM is weaker because segment labels and reference coverage are coarser than the claim-level correctness target the probe needs. Claim construction and label provenance affect probe trainability as much as the probe architecture does.

A probe-based uncertainty system starts before the classifier runs. Define the claim. Name what the label captures. Align the activation span with the label unit. With those three pieces aligned, a regularized linear readout of the residual stream recovers a sizeable correctness signal. With a muddier target, a stronger probe mostly learns the mud.

Bounds on the claim

Honest scope

Next work

Three follow-ups would most sharpen the claim, in roughly this order.

Close the human-audit loop

The 30-claim MAUD audit packet is frozen at data/annotations/maud_human_audit_packet.jsonl. Completed expert labels would convert MAUD's claim from "stable across two LLM judges" to "stable across two LLM judges, anchored to expert legal labels," or expose a systematic disagreement worth knowing about. Either outcome is publishable.

Matched-generation transfer

FActScore reduces the generation-mismatch worry while leaving some of it open. A matched-generation run on a dataset where Llama's claims stay inside the reference scope would close that gap. The FELM repair shows the failure mode to avoid: reference-bounded labels driven into not_enough_evidence.

Layer sweep and second model family

All probes are layer 19, chosen because the Goodfire SAE was trained there. A small layer sweep would document how localized the signal is. A second model family at 7-13B parameters would say whether the gap is a Llama representation quirk or a more general property of instruction-tuned residual streams.

· · ·

The result is a bounded interpretability win. A linear readout of hidden activations sees something the model's own structured confidence misses, and the gap is the same size on legal QA and on biography factuality. I do not claim a compact SAE feature story, a legal-truth adjudicator, or a domain-general hallucination detector.

The remaining questions sit on label provenance and matched-generation transfer, not on probe architecture. Downstream tools building on this should expose claim construction and label-provenance assumptions to users rather than hiding them behind opaque confidence scores. The probe is only as honest as the labels it was trained on.