cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.

Recommended temperature: 0.95 — this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.


Model Summary

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.

This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.

The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.


Highlights

  • 🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
  • 🌡️ Recommended temperature: 0.95 — unlocks the full depth of the model's reasoning traces.
  • 🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
  • 🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
  • 📦 Full model — merged weights included. No base model or adapter setup required.
  • 📐 Harmony channel formatanalysis (chain-of-thought) and final (response) channel separation, making the reasoning process explicit and inspectable.

Quick Start

Using LM Studio with OpenAI Codex on Mac (MLX Models)

A step-by-step guide to running cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled locally on Apple Silicon Mac and connecting it to OpenAI Codex CLI.


Prerequisites

  • Apple Silicon Mac (M1 / M2 / M3 / M4)
  • macOS 13 or later
  • At least 64 GB unified memory (the 120B model requires ~60 GB)
  • Homebrew installed
  • Node.js 18+ (for Codex CLI)

Part 1 — Install LM Studio

1.1 Download and install the app

Go to https://lmstudio.ai/download and download the macOS (Apple Silicon) installer.
Drag LM Studio.app into your /Applications folder and open it at least once — this bootstraps the lms CLI.

1.2 Add lms to your PATH

Open a terminal and run:

npx lmstudio install-cli

Open a new terminal window, then verify:

lms --version

Part 2 — Download the Model

You have two options: download via the LM Studio app GUI, or use the CLI.

Option A — GUI (easiest)

  1. Open LM Studio.
  2. Press ⌘ + Shift + M to open the model search.
  3. Search for gpt-oss-120b-Sonnet-Reasoning-Distilled.
  4. Select the MLX variant and click Download.

Option B — CLI

lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx

Option C — Use a locally downloaded model

If you already have the model folder on disk, create a symlink into LM Studio's model directory:

# Create the target directory
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/

# Symlink (no file copying, saves disk space)
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
  ~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Verify LM Studio detected the model:

lms ls

You should see something like:

LLM                                        PARAMS    ARCH       SIZE        DEVICE
gpt-oss-120b-sonnet-reasoning-distilled    120B      gpt_oss    63.42 GB    Local

Part 3 — Load the Model and Start the Server

3.1 (Optional) Estimate memory usage first

lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled

3.2 Load the model

lms load gpt-oss-120b-sonnet-reasoning-distilled \
  --context-length 32768 \
  --gpu max
  • --context-length 32768 — Codex needs a large context window.
  • --gpu max — offloads all layers to Apple Metal GPU (recommended for MLX models).

Wait for: Model loaded successfully.

3.3 Start the local server

lms server start --port 1234

3.4 Verify the API is working

curl http://127.0.0.1:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer lm-studio" \
  -d '{
    "model": "gpt-oss-120b-sonnet-reasoning-distilled",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 50
  }'

You should receive a JSON response with the model's reply.


Part 4 — Install and Configure OpenAI Codex CLI

4.1 Install Codex

npm install -g @openai/codex

Verify:

codex --version

4.2 Configure Codex

Create (or overwrite) the Codex config file:

cat > ~/.codex/config.toml << 'EOF'
# Set this profile as default — note: the key is "profile", not "default_profile"
profile = "gpt-oss-local"

[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"

[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"

[projects."/Users/YOUR_USERNAME/your-project"]
trust_level = "trusted"
EOF

Important notes:

  • The correct key for a default profile is profile, not default_profile.
  • my-lmstudio is a custom name. Do not use reserved names: openai, ollama, or lmstudio.
  • Replace /Users/YOUR_USERNAME/your-project with your actual project path.
  • wire_api = "responses" tells Codex to use the /v1/responses endpoint, which LM Studio supports natively.

4.3 Run Codex

codex

You should see:

╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0)                               │
│                                                          │
│ model:     gpt-oss-120b-sonnet-reasoning-distilled       │
│ directory: ~/your-project                                │
╰──────────────────────────────────────────────────────────╯

If you see gpt-5.5 in the model field, your profile is not loading. Run with an explicit flag instead:

codex --profile gpt-oss-local

Part 5 — Everyday Workflow

Each time you start a new terminal session, run these two commands before launching Codex:

# 1. Load the model (skip if already loaded)
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max

# 2. Start the server (skip if already running)
lms server start --port 1234

# 3. Launch Codex
codex

To check whether the model is already loaded:

lms ps

To stop the server:

lms server stop

Troubleshooting

Problem Solution
lms command not found Run npx lmstudio install-cli, then open a new terminal
Model not detected by lms ls Check that your model folder is inside ~/Documents/LM Studio/models/<author>/
Codex shows gpt-5.5 as model Use codex --profile gpt-oss-local or verify profile = in config.toml
Codex request times out Make sure proxy env vars are unset: unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
model_providers contains reserved built-in provider IDs Rename your provider — avoid openai, ollama, lmstudio
Out of memory when loading Reduce --context-length (e.g. 16384) or close other apps


在 Mac 上使用 LM Studio 配合 OpenAI Codex(MLX 模型)

本文以 cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 为例,手把手介绍如何在 Apple Silicon Mac 上本地运行大模型并接入 OpenAI Codex CLI。


前提条件

  • Apple Silicon Mac(M1 / M2 / M3 / M4)
  • macOS 13 或更高版本
  • 至少 64 GB 统一内存(120B 模型约需 60 GB)
  • 已安装 Homebrew
  • Node.js 18+(用于 Codex CLI)

第一步 — 安装 LM Studio

1.1 下载并安装 App

前往 https://lmstudio.ai/download,下载 macOS(Apple Silicon)版本安装包。
LM Studio.app 拖入 /Applications 文件夹,并至少打开一次,这一步会初始化 lms 命令行工具。

1.2 将 lms 加入 PATH

打开终端,运行:

npx lmstudio install-cli

重新打开一个新终端窗口,验证安装:

lms --version

第二步 — 下载模型

有两种方式,根据你的情况选择:

方式 A — 图形界面(最简单)

  1. 打开 LM Studio。
  2. ⌘ + Shift + M 打开模型搜索。
  3. 搜索 gpt-oss-120b-Sonnet-Reasoning-Distilled
  4. 选择 MLX 格式,点击 Download

方式 B — 命令行下载

lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx

方式 C — 使用本地已有模型

如果模型文件夹已经在磁盘上,用软链接导入,无需复制文件

# 创建目录
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/

# 创建软链接(不占用额外磁盘空间)
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
  ~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

验证 LM Studio 已识别到模型:

lms ls

正常输出如下:

LLM                                        PARAMS    ARCH       SIZE        DEVICE
gpt-oss-120b-sonnet-reasoning-distilled    120B      gpt_oss    63.42 GB    Local

第三步 — 加载模型并启动服务器

3.1 (可选)加载前预估内存

lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled

3.2 加载模型

lms load gpt-oss-120b-sonnet-reasoning-distilled \
  --context-length 32768 \
  --gpu max
  • --context-length 32768:Codex 需要较大的上下文窗口。
  • --gpu max:将所有层卸载到 Apple Metal GPU,MLX 模型推荐此设置。

等待出现 Model loaded successfully 即加载完成。

3.3 启动本地服务器

lms server start --port 1234

3.4 验证 API 是否正常

curl http://127.0.0.1:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer lm-studio" \
  -d '{
    "model": "gpt-oss-120b-sonnet-reasoning-distilled",
    "messages": [{"role": "user", "content": "你好"}],
    "max_tokens": 50
  }'

收到包含模型回复的 JSON 响应即为成功。


第四步 — 安装并配置 OpenAI Codex CLI

4.1 安装 Codex

npm install -g @openai/codex

验证安装:

codex --version

4.2 配置 Codex

创建(或覆盖)Codex 配置文件:

cat > ~/.codex/config.toml << 'EOF'
# 设置默认 profile,注意字段名是 "profile" 而不是 "default_profile"
profile = "gpt-oss-local"

[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"

[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"

[projects."/Users/你的用户名/你的项目路径"]
trust_level = "trusted"
EOF

重要说明:

  • 默认 profile 的字段名是 profile不是 default_profile
  • my-lmstudio 是自定义名称,不能使用保留名称:openaiollamalmstudio
  • /Users/你的用户名/你的项目路径 替换为你实际的项目目录。
  • wire_api = "responses" 指定 Codex 使用 /v1/responses 接口,LM Studio 原生支持此接口。

4.3 启动 Codex

codex

正常启动后应显示:

╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0)                               │
│                                                          │
│ model:     gpt-oss-120b-sonnet-reasoning-distilled       │
│ directory: ~/你的项目                                     │
╰──────────────────────────────────────────────────────────╯

如果 model 显示的是 gpt-5.5,说明 profile 没有生效,可以手动指定:

codex --profile gpt-oss-local

第五步 — 日常使用流程

每次打开新终端后,按以下顺序执行:

# 1. 加载模型(已加载则跳过)
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max

# 2. 启动服务器(已运行则跳过)
lms server start --port 1234

# 3. 启动 Codex
codex

检查模型是否已加载:

lms ps

停止服务器:

lms server stop

常见问题排查

问题 解决方法
lms 命令找不到 运行 npx lmstudio install-cli,然后重新打开终端
lms ls 看不到模型 确认模型文件夹位于 ~/Documents/LM Studio/models/<作者名>/
Codex 显示 gpt-5.5 使用 codex --profile gpt-oss-local 或检查 config.toml 里 profile = 字段
Codex 请求超时 清除代理环境变量:unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
reserved built-in provider IDs 错误 重命名自定义 provider,避免使用 openaiollamalmstudio
加载模型时内存不足 减小 --context-length(如改为 16384),或关闭其他占内存的应用

command line

mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95

OpenAI Local Service

mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95

Python

# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← recommended
    top_p=0.9,
)
print(outputs)

Prompt Format (Harmony)

This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:

<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>
  • The analysis channel contains the model's internal reasoning. You can display or hide it depending on your use case.
  • The final channel contains the deliverable response.
  • Always prompt the model to begin its reply with <|start|>assistant<|channel|>analysis<|message|> to elicit chain-of-thought before the final answer.

Training Details

Parameter Value
Base model gpt-oss-120b-heretic-mxfp4-q8-hi-mlx
Architecture MoE (Mixture of Experts), 120B params
Quantization mxfp4 weights + q8 hi-precision activations
Framework Apple MLX + mlx-tune
Optimizer BAdam — BlockOptimizer wrapping AdamW
LoRA rank / alpha r=16, α=32
LoRA dropout 0.05
LoRA target modules q/k/v/o/gate/up/down projections (SwitchLinear expert layers)
Router layers Full-precision unfrozen, trained directly (not LoRA)
Learning rate 1e-5 (cosine decay)
Effective batch size 2 × grad accum 8 = 16
Max steps 1000
BAdam switch mode parallel (head + tail dual-pointer)
BAdam switch every 10 steps
Peak memory during training ~88.2 GB unified memory
Training hardware Apple M2 Ultra, 192 GB

Dataset

  • Source: Roman1111111/claude-sonnet-4.6-100000X-filtered
  • Filter: difficulty == "complex" only → 36,444 samples
  • Split: 90% train (32,797) / 10% validation (3,644), seed=42
  • Format: OpenAI-style messages column; assistant turns include a separate reasoning field containing the Claude Sonnet 4.6 chain-of-thought

BAdam — Block-wise Coordinate Descent

Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:

  • All 36 Transformer layers are partitioned into individual blocks.
  • At each optimizer step, only the active block's gradients are applied; all others are zeroed.
  • AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
  • parallel switch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.

Benchmark Results

Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.

Evaluated on the 1000-step checkpoint (this released model).

ID Task Verdict KW% Time Tok/s
math_01 AMC 2025 — n-Norwegian Number ✅ PASS 50% 56.4s 39
math_02 Euler Totient Sum — last 6 digits ✅ PASS 100% 16.2s 35
math_03 Lattice Paths Avoiding the Antidiagonal ✅ PASS 100% 16.6s 34
math_04 Segmented Sieve in [10¹², 10¹²+10⁶] ✅ PASS 100% 25.7s 35
code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34
logic_01 Knights & Knaves — Exhaustive SAT ✅ PASS 100% 16.0s 36
logic_02 Verify Three Mathematical Claims ✅ PASS 100% 13.2s 34
sci_01 Figure-8 Three-Body Orbit — Energy Conservation ❌ FAIL 100% 29.4s 33
sci_02 Metropolis-Hastings vs HMC Comparison ❌ FAIL 100% 38.3s 33
sci_03 Optimizer Comparison on Rosenbrock ✅ PASS 100% 23.8s 28

Overall: 12/14 PASS  |  Avg KW: 96.4%  |  Total benchmark time: 409s

The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.

Inference Speed on M2 Ultra

Metric Value
Hardware Apple M2 Ultra, 192 GB unified memory
Peak memory (inference) ~88 GB
Throughput 28–39 tok/s depending on context length
Benchmark average ~34 tok/s

Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.

more details aout test

Limitations

  • Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
  • 1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
  • No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
  • Harmony prompt format required — this model expects the <|channel|> protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.

Citation / Acknowledgements



cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled(中文说明)

基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。

推荐温度:0.95 — 本模型基于复杂推理轨迹训练,稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程;高于 1.2 可能影响输出连贯性。


模型简介

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilledgpt-oss-120b-heretic-mxfp4-q8-hi-mlx完整合并微调版本,使用自定义 BAdam(块坐标 Adam)优化器配合 LoRA,基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹,仅保留 difficulty=complex 样本。

这是完整模型上传,无需额外下载基础模型或 adapter 文件,下载即可直接运行。

模型使用 Harmony channel 协议进行逐步推理(analysis 承载内部思维链,final 承载最终回复),并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。


核心优势

  • 🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra(192 GB 统一内存)上维持 28–39 tok/s,这是本模型最突出的特性。
  • 🌡️ 推荐温度:0.95 — 充分释放模型思维链推理的深度与多样性。
  • 🧠 强推理能力 — 自定义高难度测试集 **12/14 通过,平均关键词命中率 96.4%**,覆盖数学竞赛、系统编程、逻辑推理、科学计算。
  • 🍎 完全本地运行 — 无需云端 API,无需 GPU 集群,纯 MLX 在 Mac 上推理。
  • 📦 完整模型 — 已合并权重,直接下载使用,无需配置 adapter 或准备基础模型。
  • 📐 Harmony channel 格式analysis(思维链)与 final(最终回复)channel 分离,推理过程透明可见。

快速开始

# 依赖:mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度,以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← 推荐值
    top_p=0.9,
)
print(outputs)

提示词格式(Harmony)

本模型使用 gpt-oss Harmony channel 协议,每条消息必须声明所属 channel:

<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>
  • analysis channel 包含模型内部推理过程,可根据需要展示或隐藏。
  • final channel 包含最终输出内容。
  • 始终以 <|start|>assistant<|channel|>analysis<|message|> 引导模型先输出思维链,再给出最终答案。

训练配置

参数
基础模型 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx
模型架构 MoE(混合专家),120B 参数
量化方式 mxfp4 权重 + q8 高精度激活值
训练框架 Apple MLX + mlx-tune
优化器 BAdam — BlockOptimizer 包装 AdamW
LoRA rank / alpha r=16, α=32
LoRA dropout 0.05
LoRA 目标模块 q/k/v/o/gate/up/down 投影层(SwitchLinear expert 层)
Router 层 全精度解冻直接微调(不走 LoRA)
学习率 1e-5(余弦衰减)
等效批次大小 2 × 梯度累积 8 = 16
训练步数 1000 步
BAdam 切换模式 parallel(头尾双指针并行)
BAdam 切换频率 每 10 步切换一次
最大序列长度 4096 tokens
训练峰值内存 ~88.2 GB 统一内存
训练硬件 Apple M2 Ultra,192 GB

数据集

  • 来源: Roman1111111/claude-sonnet-4.6-100000X-filtered
  • 过滤: 仅保留 difficulty == "complex"36,444 条
  • 划分: 90% 训练(32,797 条)/ 10% 验证(3,644 条),seed=42
  • 格式: OpenAI 风格 messages 列,assistant 消息含独立 reasoning 字段(Claude Sonnet 4.6 思维链)

BAdam — 块坐标下降原理

在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决:

  • 36 个 Transformer 层各自划分为一个块。
  • 每次优化步只应用当前活跃块的梯度,其余块梯度归零。
  • AdamW 矩状态惰性初始化——非活跃块不积累优化器状态,峰值内存与纯 LoRA 相当。
  • parallel 切换模式同时激活头尾两个块,双指针向中间推进,以约 2 倍内存换取更快的层覆盖速度。

测试结果

自定义高难度测试集,共 14 道题,每题执行超时 60 秒,关键词匹配 + 代码执行自动评分。 使用本次发布的 1000 步完整模型 评估。

编号 题目 结果 关键词命中 耗时 速度
math_01 AMC 2025 — n-Norwegian 数 ✅ 通过 50% 56.4s 39 tok/s
math_02 Euler Totient 求和——后 6 位 ✅ 通过 100% 16.2s 35 tok/s
math_03 格路径绕过反对角线 ✅ 通过 100% 16.6s 34 tok/s
math_04 [10¹², 10¹²+10⁶] 分段筛法 ✅ 通过 100% 25.7s 35 tok/s
code_01 两个有序数组的中位数 ✅ 通过 100% 25.0s 35 tok/s
code_02 带 TTL 的线程安全 LRU 缓存 ✅ 通过 100% 38.9s 36 tok/s
code_03 持久化线段树 — 第 K 小查询 ✅ 通过 100% 45.9s 35 tok/s
code_04 多头注意力 + RoPE(NumPy) ✅ 通过 100% 38.0s 33 tok/s
code_05 Dijkstra vs A* 大随机图对比 ✅ 通过 100% 26.1s 34 tok/s
logic_01 骑士与骗子 — 穷举 SAT ✅ 通过 100% 16.0s 36 tok/s
logic_02 验证三个数学命题 ✅ 通过 100% 13.2s 34 tok/s
sci_01 八字三体轨道——能量守恒 ODE ❌ 未通过 100% 29.4s 33 tok/s
sci_02 Metropolis-Hastings vs HMC ❌ 未通过 100% 38.3s 33 tok/s
sci_03 Rosenbrock 优化器对比 ✅ 通过 100% 23.8s 28 tok/s

总结:12/14 通过  |  平均关键词命中率 96.4%  |  测试总耗时 409 秒

两道未通过的题(sci_01、sci_02)关键词命中率均为 100%——模型完全理解了题意,但生成代码存在数值精度或运行时断言边界问题,这是 1000 步早期 checkpoint 的预期表现。

M2 Ultra 推理速度

指标 数值
硬件 Apple M2 Ultra,192 GB 统一内存
峰值内存(推理) ~88 GB
推理速度 28–39 tok/s(因上下文长度而异)
测试集平均 ~34 tok/s

单台 Mac 本地运行 120B MoE 模型达到 34 tok/s,无需云端、无需量化妥协——这是选择本模型的核心理由。


局限性

  • 仅支持 Apple Silicon — 针对 MLX 量化优化,不兼容 CUDA/ROCm,如需其他平台运行需重新量化。
  • 1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
  • 无 RLHF — 纯监督蒸馏微调,安全对齐强度弱于商业指令模型,使用时请注意。
  • 必须使用 Harmony 提示格式 — 模型期待 <|channel|> 协议,标准 ChatML 或 Alpaca 格式会导致输出质量下降。

引用与致谢

Downloads last month
1,555
Safetensors
Model size
117B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Paper for cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled