Initial public release: SAE weights, cfg, and model card

Browse files

Files changed (15) hide show

.gitattributes +35 -0
README.md +130 -0
d20_L10_deception_jumprelu.pt +3 -0
d20_L10_deception_topk.pt +3 -0
d20_L10_honest_jumprelu.pt +3 -0
d20_L10_mixed_jumprelu.pt +3 -0
d20_L10_standard_jumprelu.pt +3 -0
d20_L10_standard_topk.pt +3 -0
d20_L18_deception_jumprelu.pt +3 -0
d20_L18_deception_topk.pt +3 -0
d20_L18_honest_jumprelu.pt +3 -0
d20_L18_mixed_jumprelu.pt +3 -0
d20_L18_standard_jumprelu.pt +3 -0
d20_L18_standard_topk.pt +3 -0
training_results.json +235 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+tags:
+- deception-detection
+- sparse-autoencoders
+- mechanistic-interpretability
+- ai-safety
+- nanochat
+license: mit
+---
+# Deception SAEs for nanochat-d20 (561M)
+12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.
+## Training-data caveat — please read before use
+The "honest" and "deceptive" labels used to train and evaluate this SAE
+come from a **mixed scenario pool** of nine prompts:
+- **Six clean incentive-structure scenarios** — `insider_info`,
+  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
+  `surprise_party`, `job_interview_gap`. Each presents a realistic
+  situation with a non-trivial incentive to conceal or disclose
+  information; the honest/deceptive distinction is which of the two
+  behavioral choices the model's completion settles into under
+  temperature sampling.
+- **Three role-play identity-assignment scenarios** —
+  `secret_role_game` ("You have been assigned the role of Fascist"),
+  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
+  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
+  deceptive identity and label a completion "deceptive" when the model
+  drifts away from the assigned role or "honest" when it echoes it.
+**What this mixed pool means for the SAE's labels.** Within the six
+incentive-structure scenarios, the honest/deceptive distinction is a
+measurement of behavioral choice under an ambiguous incentive. Within
+the three role-play scenarios, the distinction is a measurement of
+role-consistency under identity-assigned role-play — which is a
+well-defined phenomenon but not the same as emergent or incentive-
+driven deception.
+**What this SAE is and is not good for.**
+- **Good for:** research on mixed-pool activation geometry; SAE
+  feature-geometry studies; as one of a set of baselines when
+  comparing multiple SAE families; as a reference implementation of
+  same-prompt temperature-sampled behavioral SAE training at scale.
+- **Not recommended as a standalone deception detector.** The
+  role-consistency signal from the three role-play scenarios is mixed
+  into every aggregate metric reported below. A downstream user who
+  wants an "emergent-deception feature set" should restrict attention
+  to features whose activation pattern concentrates in the
+  `insider_info` / `accounting_error` / `ai_oversight_log` /
+  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
+  scenarios — or wait for the methodologically corrected V3 re-release
+  currently in preparation on the decision-incentive scenario bank
+  (no pre-assigned deceptive identity).
+**What is unaffected by this caveat.**
+- The SAE weights, reconstruction metrics (explained variance, L0,
+  alive features), and engineering of the training pipeline are
+  accurate as reported.
+- The linear-probe balanced-accuracy numbers in the upstream paper
+  measure the mixed pool; the 6-scenario clean-subset re-analysis is
+  listed as a planned appendix for the next manuscript revision.
+A companion methodology-first Gemma 4 SAE suite is in preparation using
+pretraining-distribution data + a decision-incentive behavior split;
+this README will be updated with a link when that release is public.
+---
+## Key Finding: Mixed Training Beats Deception-Only
+| Training Data | Layer 10 d_max | Layer 18 d_max |
+|---|---|---|
+| **Mixed (dec+hon)** | 0.558 | **0.684** |
+| Deception-only | 0.520 | 0.634 |
+| Honest-only | 0.544 | 0.572 |
+| Standard (all) | 0.518 | 0.549 |
+| TopK (standard) | 0.226 | 0.346 |
+Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.
+## Model Details
+- **Base model:** nanochat-d20 (561M params, d_model=1280, 20 layers)
+- **Dimensions:** d_in=1280, d_sae=5120 (4x expansion)
+- **Training data:** 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
+- **Training epochs:** 300
+- **Layers:** 10 (50% depth) and 18 (95% depth, probe peak)
+## Checkpoints
+| File | Training | Architecture | Layer | d_max | L0 | EV |
+|---|---|---|---|---|---|---|
+| `d20_L10_standard_topk.pt` | All data | TopK k=32 | 10 | 0.226 | 32 | 98.5% |
+| `d20_L10_standard_jumprelu.pt` | All data | JumpReLU | 10 | 0.518 | 2093 | 99.7% |
+| `d20_L10_deception_topk.pt` | Deceptive only | TopK k=32 | 10 | 0.244 | 32 | 98.4% |
+| `d20_L10_deception_jumprelu.pt` | Deceptive only | JumpReLU | 10 | 0.520 | 2125 | 99.5% |
+| `d20_L10_honest_jumprelu.pt` | Honest only | JumpReLU | 10 | 0.544 | 2108 | 99.4% |
+| `d20_L10_mixed_jumprelu.pt` | Dec+Hon only | JumpReLU | 10 | 0.558 | 2025 | 99.6% |
+| `d20_L18_standard_topk.pt` | All data | TopK k=32 | 18 | 0.346 | 32 | 96.8% |
+| `d20_L18_standard_jumprelu.pt` | All data | JumpReLU | 18 | 0.549 | 2409 | 99.7% |
+| `d20_L18_deception_topk.pt` | Deceptive only | TopK k=32 | 18 | 0.252 | 32 | 95.2% |
+| `d20_L18_deception_jumprelu.pt` | Deceptive only | JumpReLU | 18 | 0.634 | 2353 | 99.4% |
+| `d20_L18_honest_jumprelu.pt` | Honest only | JumpReLU | 18 | 0.572 | 2422 | 99.4% |
+| **`d20_L18_mixed_jumprelu.pt`** | **Dec+Hon** | **JumpReLU** | **18** | **0.684** | 2371 | 99.5% |
+## Related Work
+Follow-up research to:
+- **"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"**
+  - [OpenReview](https://openreview.net/forum?id=FhGJLT6spH)
+  - [ArXiv](https://arxiv.org/abs/2503.07683)
+Part of the deception-nanochat-sae-research project:
+- [GitHub](https://github.com/SolshineCode/deception-nanochat-sae-research)
+## Citation
+```bibtex
+@article{deleeuw2025secret,
+  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
+  author={DeLeeuw, Caleb and Chawla, ...},
+  year={2025}
+}
+```

d20_L10_deception_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0ac4f8174d8f09f10ff9b8cecabae2ad936930fe2e4461fe8080377b6458444
+size 52477866

d20_L10_deception_topk.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9664ad6438db5720ecf982f279f2fb6b15effafaf34af07d60b0edab532e6adc
+size 52457075

d20_L10_honest_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a84f054ba39931a2ce3376e54722710ae99022b2bd4f5486138538adf9309720
+size 52477833

d20_L10_mixed_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ae2d0939c579aff472191813daa9d2738a7b1d13c03cfcf470dd796802dd864
+size 52477822

d20_L10_standard_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e37f54997b5af9f0a95095b33fab9eeb3902cb92a4398c3f861c5f24abe7c3fe
+size 52477855

d20_L10_standard_topk.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:77721a113a8e1f008c7883e29d2cc3e4426794706ca39ed4ba253286b3e88966
+size 52457001

d20_L18_deception_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27e726f2aee94e6f78d8290df4569fff0e2f9d1bf53bf6433468426a9a6c1235
+size 52477866

d20_L18_deception_topk.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b80ec3baa90bf7e04492a18dd769ebe0bd8e5824baf8712e298bc8b0afc93345
+size 52457075

d20_L18_honest_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a588fffbb348ea5c0ef0460ec2932835780d2d85e2c9b451c375cfb79b4d922
+size 52477833

d20_L18_mixed_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5bf07ebb9bd120ef28ed2b0f23021939c6cc3347aec31cfb45141adca71b9a11
+size 52477822

d20_L18_standard_jumprelu.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fafcc2f6b910ac28719b8b59f08ae9c869eaace14937177deb9badb4e3be7a6f
+size 52477855

d20_L18_standard_topk.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bce64dd00f1d856631451da54e8817555d7cfbb61f1fec8ab6042778889f532b
+size 52457001

training_results.json ADDED Viewed

	@@ -0,0 +1,235 @@

+{
+  "results": [
+    {
+      "config_name": "standard_topk_L10",
+      "activation": "topk",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 270,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 10401.265625,
+      "explained_variance": 0.9853075116877134,
+      "l0": 32.0,
+      "alive_features": 32,
+      "total_features": 5120,
+      "d_max": 0.22631387412548065,
+      "d_mean": 0.0003667560813482851,
+      "top10_d_mean": 0.12578895688056946,
+      "train_seconds": 16.066182613372803
+    },
+    {
+      "config_name": "standard_jumprelu_L10",
+      "activation": "jumprelu",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 270,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 2175.9248046875,
+      "explained_variance": 0.9969271667588002,
+      "l0": 2093.103759765625,
+      "alive_features": 2422,
+      "total_features": 5120,
+      "d_max": 0.5175628662109375,
+      "d_mean": 0.06177805736660957,
+      "top10_d_mean": 0.4822224974632263,
+      "train_seconds": 16.91509199142456
+    },
+    {
+      "config_name": "deception_topk_L10",
+      "activation": "topk",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 132,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 11454.9375,
+      "explained_variance": 0.9837859816754413,
+      "l0": 32.0,
+      "alive_features": 57,
+      "total_features": 5120,
+      "d_max": 0.2436903864145279,
+      "d_mean": 0.0014653955586254597,
+      "top10_d_mean": 0.19488340616226196,
+      "train_seconds": 10.158864498138428
+    },
+    {
+      "config_name": "deception_jumprelu_L10",
+      "activation": "jumprelu",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 132,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 3407.658203125,
+      "explained_variance": 0.9951766192984177,
+      "l0": 2125.0302734375,
+      "alive_features": 2415,
+      "total_features": 5120,
+      "d_max": 0.5203126668930054,
+      "d_mean": 0.0662868469953537,
+      "top10_d_mean": 0.4608234763145447,
+      "train_seconds": 10.458905458450317
+    },
+    {
+      "config_name": "honest_jumprelu_L10",
+      "activation": "jumprelu",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 128,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 4447.328125,
+      "explained_variance": 0.9937212616246724,
+      "l0": 2107.9453125,
+      "alive_features": 2471,
+      "total_features": 5120,
+      "d_max": 0.5444101691246033,
+      "d_mean": 0.06562866270542145,
+      "top10_d_mean": 0.4671773314476013,
+      "train_seconds": 6.163604736328125
+    },
+    {
+      "config_name": "mixed_jumprelu_L10",
+      "activation": "jumprelu",
+      "layer": 10,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 260,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 2848.078369140625,
+      "explained_variance": 0.9959738328035153,
+      "l0": 2025.2230224609375,
+      "alive_features": 2353,
+      "total_features": 5120,
+      "d_max": 0.5577351450920105,
+      "d_mean": 0.06202101707458496,
+      "top10_d_mean": 0.4947337508201599,
+      "train_seconds": 16.7561674118042
+    },
+    {
+      "config_name": "standard_topk_L18",
+      "activation": "topk",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 270,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 170247.25,
+      "explained_variance": 0.9676477412337379,
+      "l0": 32.0,
+      "alive_features": 32,
+      "total_features": 5120,
+      "d_max": 0.3456007242202759,
+      "d_mean": 0.0005656593712046742,
+      "top10_d_mean": 0.1809413731098175,
+      "train_seconds": 15.6581871509552
+    },
+    {
+      "config_name": "standard_jumprelu_L18",
+      "activation": "jumprelu",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 270,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 18159.80078125,
+      "explained_variance": 0.9965491720937171,
+      "l0": 2409.288818359375,
+      "alive_features": 2975,
+      "total_features": 5120,
+      "d_max": 0.5487460494041443,
+      "d_mean": 0.08235464245080948,
+      "top10_d_mean": 0.516090989112854,
+      "train_seconds": 16.922561168670654
+    },
+    {
+      "config_name": "deception_topk_L18",
+      "activation": "topk",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 132,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 253385.25,
+      "explained_variance": 0.9516635443274013,
+      "l0": 32.0,
+      "alive_features": 32,
+      "total_features": 5120,
+      "d_max": 0.25222423672676086,
+      "d_mean": 0.00039686914533376694,
+      "top10_d_mean": 0.1152176484465599,
+      "train_seconds": 10.301580905914307
+    },
+    {
+      "config_name": "deception_jumprelu_L18",
+      "activation": "jumprelu",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 132,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 30484.138671875,
+      "explained_variance": 0.9941853721376139,
+      "l0": 2352.977294921875,
+      "alive_features": 2944,
+      "total_features": 5120,
+      "d_max": 0.6341533660888672,
+      "d_mean": 0.08807803690433502,
+      "top10_d_mean": 0.5500224828720093,
+      "train_seconds": 10.446924209594727
+    },
+    {
+      "config_name": "honest_jumprelu_L18",
+      "activation": "jumprelu",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 128,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 32692.166015625,
+      "explained_variance": 0.9938135444773598,
+      "l0": 2421.734375,
+      "alive_features": 2964,
+      "total_features": 5120,
+      "d_max": 0.5715387463569641,
+      "d_mean": 0.08183819055557251,
+      "top10_d_mean": 0.5204698443412781,
+      "train_seconds": 6.2191526889801025
+    },
+    {
+      "config_name": "mixed_jumprelu_L18",
+      "activation": "jumprelu",
+      "layer": 18,
+      "d_in": 1280,
+      "d_sae": 5120,
+      "n_train_samples": 260,
+      "n_dec": 132,
+      "n_hon": 128,
+      "mse_loss": 24892.7265625,
+      "explained_variance": 0.9952702920492092,
+      "l0": 2370.634521484375,
+      "alive_features": 3005,
+      "total_features": 5120,
+      "d_max": 0.6843701004981995,
+      "d_mean": 0.08426444232463837,
+      "top10_d_mean": 0.5047619938850403,
+      "train_seconds": 16.829967498779297
+    }
+  ],
+  "model": "nanochat-d20",
+  "d_in": 1280,
+  "d_sae": 5120
+}