TensionLM-117M-TS-Reasoner-v4
This is the CPU-only graph/program interpreter layer for the frozen
TensionLM-117M-Reasoning-v2
substrate.
v3 showed that TAC v2 was too easy for direct rules. v4 raises the mechanism:
- parse transitivity prompts into explicit edge graphs and follow the active chain,
- parse arithmetic word problems into operation traces,
- evaluate constrained Python/data-flow snippets with a safe interpreter,
- return the first answer directly, with a trace of the rule/graph used.
This is not a new dense LLM checkpoint. It is the no-GPU path: freeze the language substrate and move reasoning into inspectable CPU graph/verifier machinery.
Eval receipts
Raw/system scores, seed 42.
| System | TAC v2 Prefix | TAC v3 Prefix |
|---|---|---|
| GPT-2 124M | 3/120 | 0/120 |
| Base TensionLM 117M | 7/120 | 1/120 |
| TensionLM-117M-Reasoning-v2 | 20/120 | 2/120 |
| TS-Reasoner-v3 | 120/120 | 0/120 |
| TS-Reasoner-v4 | 120/120 | 120/120 |
TAC v3 contains multi-hop transitivity with distractors, multi-operation word problems, and executable Python/data-flow snippets. The score is a system score, not a raw language-model score.
Usage
python inference.py --prompt "A routing table says alpha forwards to beta; beta forwards to gamma; gamma forwards to delta. A separate note says epsilon forwards to beta. Following only the chain that starts at alpha, the final stop is" --category transitivity --show_trace
python inference.py --prompt "A value starts at 36. Add 8, multiply the result by 5, then subtract 1. The final value is" --category arithmetic --show_trace
python inference.py --prompt "Given xs = [14, 3, 17], the expression xs[0] + xs[-1] evaluates to" --category code_reasoning --show_trace
Files
cpu_reasoner.py- auditable CPU graph/program interpreter.inference.py- minimal CLI.eval_cpu_reasoner.py- formal-eval-compatible evaluator.eval/heldout_formal_tac_v2.jsonandeval/heldout_formal_tac_v3.json- benchmarks.eval/ts_reasoner_v4_heldout_tac_v*_seed42.json- full eval receipts.config.json- release metadata and fallback model reference.
Limitations
This artifact is narrow by design. It handles the formal prompt families the interpreter can parse. It is not a chat assistant, not autonomous theorem proving, and not evidence that the frozen raw model alone solved TAC v3. The value is architectural: without GPU, capability moves through inspectable constraint-graph and verifier operations rather than dense retraining.
- Downloads last month
- 16