TensionLM-117M-TS-Reasoner-v4

This is the CPU-only graph/program interpreter layer for the frozen TensionLM-117M-Reasoning-v2 substrate.

v3 showed that TAC v2 was too easy for direct rules. v4 raises the mechanism:

  1. parse transitivity prompts into explicit edge graphs and follow the active chain,
  2. parse arithmetic word problems into operation traces,
  3. evaluate constrained Python/data-flow snippets with a safe interpreter,
  4. return the first answer directly, with a trace of the rule/graph used.

This is not a new dense LLM checkpoint. It is the no-GPU path: freeze the language substrate and move reasoning into inspectable CPU graph/verifier machinery.

Eval receipts

Raw/system scores, seed 42.

System TAC v2 Prefix TAC v3 Prefix
GPT-2 124M 3/120 0/120
Base TensionLM 117M 7/120 1/120
TensionLM-117M-Reasoning-v2 20/120 2/120
TS-Reasoner-v3 120/120 0/120
TS-Reasoner-v4 120/120 120/120

TAC v3 contains multi-hop transitivity with distractors, multi-operation word problems, and executable Python/data-flow snippets. The score is a system score, not a raw language-model score.

Usage

python inference.py --prompt "A routing table says alpha forwards to beta; beta forwards to gamma; gamma forwards to delta. A separate note says epsilon forwards to beta. Following only the chain that starts at alpha, the final stop is" --category transitivity --show_trace

python inference.py --prompt "A value starts at 36. Add 8, multiply the result by 5, then subtract 1. The final value is" --category arithmetic --show_trace

python inference.py --prompt "Given xs = [14, 3, 17], the expression xs[0] + xs[-1] evaluates to" --category code_reasoning --show_trace

Files

  • cpu_reasoner.py - auditable CPU graph/program interpreter.
  • inference.py - minimal CLI.
  • eval_cpu_reasoner.py - formal-eval-compatible evaluator.
  • eval/heldout_formal_tac_v2.json and eval/heldout_formal_tac_v3.json - benchmarks.
  • eval/ts_reasoner_v4_heldout_tac_v*_seed42.json - full eval receipts.
  • config.json - release metadata and fallback model reference.

Limitations

This artifact is narrow by design. It handles the formal prompt families the interpreter can parse. It is not a chat assistant, not autonomous theorem proving, and not evidence that the frozen raw model alone solved TAC v3. The value is architectural: without GPU, capability moves through inspectable constraint-graph and verifier operations rather than dense retraining.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support