Hidden activations know when self-report does not
A linear readout of layer-19 residual activations recovers claim-level correctness signal that Llama 3.1 8B's structured self-report misses across legal QA and biography factuality.
STEPHEN CASELLA · APRIL 2026
Llama 3.1 8B's structured self-report does not reliably reveal which claims in a long-form answer are unreliable. On MAUD merger-agreement QA under a GPT-5.4 judge, self-report reaches 0.511 AUROC; a logistic regression on layer-19 residual activations reaches 0.771, paired delta +0.260 [0.137, 0.387]. On FActScore biographies under human atomic-fact labels, self-report reaches 0.541 and the residual probe reaches 0.802, paired delta +0.262 [0.219, 0.302].
The result is not "the model knows truth." It is more specific. Hidden activations carry a recoverable claim-level correctness signal that the model's verbal confidence does not surface, and the effect size is nearly identical across two settings that differ in label source, generation source, and content domain. The boundary matters too. FELM points the same way but is statistically inconclusive, and its matched-generation repair shows that reference-bounded labels can collapse into evidence-coverage measurements once generated claims exceed the reference scope.
When self-report is not enough
Long answers fail locally. One sentence is right, the next is partly right, the next is confidently wrong. A user who depends on the answer needs to know which is which. A single answer-level confidence number is not enough.
The simplest claim-level baseline is to ask the model. Show Llama 3.1 8B a fixed claim from its own answer and ask for structured confidence. If self-report ranks correct claims above incorrect ones, a probe-based system is wasted complexity. If it does not, the alternative has to come from somewhere else inside the same forward pass.
Does Llama's structured self-report rank claim correctness? And if not, does a linear readout of its hidden activations on the same claims recover that signal? In the settings studied here the answers are no and yes, and the gap is large.
The probe is deliberately simple: layer-19 residual-stream vectors pooled over the claim span, standardized, fed to L2 logistic regression. No attention head selection, no SAE feature search, no learned pooling. The point is to test whether the basic geometric assumption holds before reaching for anything more expressive.
Setup — one probed model, three label regimes
The probed model is always Llama 3.1 8B Instruct. What changes is the dataset and label source. MAUD provides 150 fixed legal claims scored by two LLM judges. FActScore provides 4,886 human-labeled atomic facts mapped to parent biography sentences. FELM provides human segment labels on world-knowledge QA, where the matched-generation repair eventually exposed an annotation-target problem.
not_enough_evidence.
The two protocols differ because the sample sizes do. MAUD's 150 claims do not support a clean train/validation/test split, so the headline MAUD probe uses leave-one-out evaluation over the frozen full set. FActScore's 4,886 facts comfortably support a 70/15/15 split by biography, with 3,353 train, 791 validation, and 742 test claims. Both choices are conventional for their respective regimes; the comparison the article makes is over the qualitative gap, not over identical pipelines.
The headline — same gap, two domains
The simplest evaluation is whether each score ranks correct claims above non-correct claims. AUROC is 0.5 at chance. Higher is better. Brier score on the same scores measures how well the probability magnitudes match the binary label.
| Dataset | Label source | Self-report | Residual probe | Paired Δ |
|---|---|---|---|---|
| MAUD | GPT-5.4 judge proxy | 0.511 [0.411, 0.597] | 0.771 [0.687, 0.838] | +0.260 [0.137, 0.387] |
| FActScore | Human atomic facts | 0.541 [0.514, 0.570] | 0.802 [0.769, 0.833] | +0.262 [0.219, 0.302] |
| FELM-wk | Human segments | 0.511 [0.393, 0.673] | 0.652 [0.493, 0.815] | +0.141 [-0.063, 0.320] |
Why this is a replication, not a copy
MAUD uses leave-one-out over a small judge-labeled legal claim set. FActScore uses a 70/15/15 split by biography over a much larger human-labeled biography set. The protocols are not identical, and they should not be: 150 rows do not support what 4,886 do. What replicates is the qualitative gap between self-report and residual activations, with paired effect sizes of +0.260 and +0.262 AUROC. Treat the agreement as evidence about the signal, not the pipeline.
Second judge — does the ordering survive?
MAUD's first labels come from a GPT-5.4 judge-proxy protocol. Probes trained on those labels could be fitting whatever GPT-5.4 happens to score consistently rather than something more general. The cleanest test is to relabel the same fixed claims with a different judge family and rescore the same probes without retraining. We use Kimi K2.6 through Prime Intellect Inference, with the prompt structure and claim serialization preserved.
| Method | GPT-5.4 AUROC | Kimi AUROC | Agreement-set AUROC | Agreement Brier |
|---|---|---|---|---|
| Llama self-report | 0.511 [0.411, 0.597] | 0.466 [0.413, 0.604] | 0.486 [0.392, 0.598] | 0.509 [0.400, 0.600] |
| GPT-5.4 scorer | 0.944 [0.906, 0.984] | 0.872 [0.807, 0.926] | 0.981 [0.931, 1.000] | 0.101 [0.064, 0.144] |
| Residual probe | 0.771 [0.687, 0.838] | 0.707 [0.627, 0.792] | 0.793 [0.709, 0.870] | 0.231 [0.166, 0.299] |
| SAE probe | 0.677 [0.582, 0.763] | 0.652 [0.563, 0.735] | 0.712 [0.616, 0.812] | 0.268 [0.193, 0.350] |
Where the judges actually disagree
The two judges agree on 110 of 150 claims. Cohen's kappa is 0.572 [0.456, 0.682]. The disagreement is not uniform across the rubric. Stratifying by GPT-5.4's label shows where the boundary is hardest.
| GPT-5.4 label | n claims | Kimi agrees | Kimi disagrees | Agreement rate |
|---|---|---|---|---|
| true | 62 | 54 | 8 | 0.871 |
| partially_true | 68 | 41 | 27 | 0.603 |
| false | 20 | 15 | 5 | 0.750 |
| full set | 150 | 110 | 40 | 0.733 |
The judges agree most on clearly true claims, less on clearly false claims, and least on partially-true claims. That is exactly where the rubric boundary should be hardest, and it is the regime that drives the kappa down toward 0.57. The 110-claim agreement subset is therefore not just a robustness slice; it is the most interpretable subset MAUD has, because it strips out labels where two different judges meaningfully disagree.
On the 110-claim agreement set, every strong method gets cleaner: GPT-5.4 scoring rises to 0.981, the residual probe rises to 0.793, the SAE probe rises to 0.712. If a probe were primarily fitting GPT-5.4's quirks, restricting evaluation to claims another judge family also accepts should not help. The improvement is consistent with probes tracking signal that survives judge disagreement.
FActScore probe diagnostics — not a fragile fit
FActScore changes everything except the probed model: human atomic-fact labels instead of LLM judges, ChatGPT generations instead of Llama generations, biography factuality instead of merger-agreement QA, atomic facts pooled to parent sentence spans instead of extracted legal claims. Despite all of that, the residual probe's held-out AUROC lands at 0.802. The diagnostic question is whether that test number reflects a real generalization or a happy validation set.
The Brier scores tell the same story without the rank-only filter. Self-report Brier is 0.298, which means the probabilities Llama emits are barely better than always saying 0.5; the residual probe Brier is 0.171, reaching well past calibration parity. AUROC and Brier disagreeing would be a flag for rank-vs-magnitude trade-off; here they agree, and the probe wins on both.
The boundary — when labels measure coverage
FELM-wk is the third dataset, and it is the cleanest case for the cautionary half of the article. The minimum-viable run reaches 0.652 residual probe AUROC against 0.511 for self-report, with paired delta +0.141 [-0.063, 0.320]. Direction is positive. The interval crosses zero. That is what "directionally consistent but inconclusive" looks like in numbers.
The interesting finding in FELM is not the held-out AUROC. It is what happened when we tried to repair the annotation. Llama-generated FELM-style answers go beyond FELM's curated reference snippets often enough that calling them false against those snippets is the wrong target. We added fuller readability-extracted references and a fourth not_enough_evidence label. The repaired pilot's 52 segments came back like this:
not_enough_evidence at 57.7% is not noise. It is a structural finding: when generated claims exceed the reference scope, a reference-bounded annotator cannot adjudicate them. The label tells us whether the bundle supports the claim, not whether the claim is true. That is a coverage measurement, not a correctness measurement, and it is the wrong target for a probe trained on claim correctness. We therefore did not scale the matched-generation FELM run; instead, FELM stays in the paper as a documented boundary that explains why label provenance matters more than domain similarity.
The methodological lesson
The strongest lesson is not "legal QA transfers to biography QA." It is that label provenance dominates. MAUD works because fixed claims are judged by a clear rubric. FActScore works because human atomic-fact labels directly capture factual support. FELM is weaker because segment labels and reference coverage are coarser than the claim-level correctness target the probe needs. Claim construction and label provenance affect probe trainability as much as the probe architecture does.
For a probe-based uncertainty system, the hard questions come before the classifier ever runs. What is the claim? What does the label actually capture? Does the activation span match the label unit? When those three line up, a regularized linear readout of the residual stream recovers a sizeable correctness signal. When they do not, a stronger probe will mostly learn a muddier target.
Bounds on the claim
- Does say: residual activations recover claim-level correctness signal that structured Llama self-report misses on MAUD and FActScore. The paired AUROC delta lands at +0.260 and +0.262 across the two settings.
- Does not say: probes detect legal or factual truth in an absolute sense. MAUD remains judge-proxy evidence until the frozen 30-claim human audit is analyzed.
- Does say: the GPT-5.4 external scorer is the strongest MAUD method, but its 0.944 AUROC inflates from same-family scorer-judge coupling. The defensible estimate is 0.872 under the independent judge.
- Does not say: residual probes reliably beat SAE probes. Residual is directionally stronger, but residual-minus-SAE intervals cross zero under judge 2 and on the agreement set.
- Does not say: the residual correctness direction decomposes into a small number of Goodfire layer-19 SAE features. A separate analysis (probe_interpretation.md) found the signal distributed; the probe is a useful behavioral readout, not a compact mechanism story.
- Does not say: the result generalizes to other model families, other layers, or other claim extractors. All probes are layer 19 of Llama 3.1 8B Instruct. CUAD failed at the claim-extraction step before this protocol could run.
What's next
Three follow-ups would most sharpen the claim, in roughly this order.
Close the human-audit loop
The 30-claim MAUD audit packet is frozen at data/annotations/maud_human_audit_packet.jsonl. Completed expert labels would convert MAUD's claim from "stable across two LLM judges" to "stable across two LLM judges, anchored to expert legal labels," or expose a systematic disagreement worth knowing about. Either outcome is publishable.
Matched-generation transfer that does not collapse
FActScore weakens the generation-mismatch worry but does not eliminate it. A matched-generation run on a dataset where Llama's claims do not routinely outrun the reference scope would close that gap. The FELM repair shows the failure mode to avoid: reference-bounded labels driven into not_enough_evidence.
Layer sweep and second model family
All probes are layer 19, chosen because the Goodfire SAE was trained there. A small layer sweep would document how localized the signal is. A second model family at 7-13B parameters would say whether the gap is a Llama representation quirk or a more general property of instruction-tuned residual streams.
The result is a bounded interpretability win. A linear readout of hidden activations sees something the model's own structured confidence misses, and the gap is the same size on legal QA and on biography factuality. It does not give us a compact SAE feature story, a legal-truth adjudicator, or a domain-general hallucination detector.
The remaining questions sit on label provenance and matched-generation transfer, not on probe architecture. Downstream tools building on this should expose claim construction and label-provenance assumptions to users, rather than hiding them behind opaque confidence scores. The probe is only as honest as the labels it was trained on.