TensionLM-117M-TS-Reasoner-v4

This is the CPU-only graph/program interpreter layer for the frozen TensionLM-117M-Reasoning-v2 substrate.

v3 showed that TAC v2 was too easy for direct rules. v4 raises the mechanism:

parse transitivity prompts into explicit edge graphs and follow the active chain,
parse arithmetic word problems into operation traces,
evaluate constrained Python/data-flow snippets with a safe interpreter,
return the first answer directly, with a trace of the rule/graph used.

This is not a new dense LLM checkpoint. It is the no-GPU path: freeze the language substrate and move reasoning into inspectable CPU graph/verifier machinery.

Eval receipts

Raw/system scores, seed 42.

System	TAC v2 Prefix	TAC v3 Prefix
GPT-2 124M	3/120	0/120
Base TensionLM 117M	7/120	1/120
TensionLM-117M-Reasoning-v2	20/120	2/120
TS-Reasoner-v3	120/120	0/120
TS-Reasoner-v4	120/120	120/120

TAC v3 contains multi-hop transitivity with distractors, multi-operation word problems, and executable Python/data-flow snippets. The score is a system score, not a raw language-model score.

Usage

python inference.py --prompt "A routing table says alpha forwards to beta; beta forwards to gamma; gamma forwards to delta. A separate note says epsilon forwards to beta. Following only the chain that starts at alpha, the final stop is" --category transitivity --show_trace

python inference.py --prompt "A value starts at 36. Add 8, multiply the result by 5, then subtract 1. The final value is" --category arithmetic --show_trace

python inference.py --prompt "Given xs = [14, 3, 17], the expression xs[0] + xs[-1] evaluates to" --category code_reasoning --show_trace

Files

cpu_reasoner.py - auditable CPU graph/program interpreter.
inference.py - minimal CLI.
eval_cpu_reasoner.py - formal-eval-compatible evaluator.
eval/heldout_formal_tac_v2.json and eval/heldout_formal_tac_v3.json - benchmarks.
eval/ts_reasoner_v4_heldout_tac_v*_seed42.json - full eval receipts.
config.json - release metadata and fallback model reference.

Limitations

This artifact is narrow by design. It handles the formal prompt families the interpreter can parse. It is not a chat assistant, not autonomous theorem proving, and not evidence that the frozen raw model alone solved TAC v3. The value is architectural: without GPU, capability moves through inspectable constraint-graph and verifier operations rather than dense retraining.

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support