TensionLM-117M-TS-Reasoner-v3
This is the CPU-only TS reasoning layer built around the frozen
TensionLM-117M-Reasoning-v2
release.
It is not a new dense checkpoint. It is a small auditable reasoning shell:
- parse the prompt into a tiny constraint graph,
- solve/verify arithmetic, transitivity, or small Python facts on CPU,
- return the first answer directly,
- fall back to the frozen TensionLM model only when no rule matches.
Held-out TAC v2 result
Benchmark: heldout_formal_tac_v2, 120 prompts, balanced as 40/40/40
arithmetic/code/transitivity.
| System | Prefix | Substring |
|---|---|---|
| GPT-2 124M | 3/120 | 5/120 |
| Base TensionLM 117M | 7/120 | 11/120 |
| TensionLM-117M-Reasoning-v2 | 20/120 | 21/120 |
| TS-Reasoner-v3 CPU layer | 120/120 | 120/120 |
This score is a system score, not a raw language-model score. The major jump comes from moving reasoning into explicit TS graph/verifier operations instead of trying to retrain all 117M parameters without GPU.
Usage
python inference.py --prompt "If red leads to blue and blue leads to green, then red leads to" --category transitivity
python inference.py --prompt "46 plus 9 equals" --category arithmetic
python inference.py --prompt "In Python, list(range(8)) ends with" --category code_reasoning
Files
cpu_reasoner.py- auditable CPU solver/verifier layer.inference.py- minimal CLI.eval_cpu_reasoner.py- formal-eval-compatible evaluator.eval/ts_reasoner_v3_heldout_tac_v2_seed42.json- full eval receipt.config.json- release metadata and fallback model reference.
Limitations
This artifact is intentionally narrow. It handles short formal prompts that match the implemented arithmetic, transitivity, and Python-fact rules. It is not a chat assistant and it is not evidence that the raw frozen model alone solved all 120 prompts. The point is the no-GPU path: freeze the language substrate, then move capability through inspectable graph/verifier machinery.
- Downloads last month
- 27