cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.

Recommended temperature: 0.95 — this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.

Model Summary

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.

This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.

The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.

Highlights

🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
🌡️ Recommended temperature: 0.95 — unlocks the full depth of the model's reasoning traces.
🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
📦 Full model — merged weights included. No base model or adapter setup required.
📐 Harmony channel format — analysis (chain-of-thought) and final (response) channel separation, making the reasoning process explicit and inspectable.

Quick Start

Using LM Studio with OpenAI Codex on Mac (MLX Models)

A step-by-step guide to running cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled locally on Apple Silicon Mac and connecting it to OpenAI Codex CLI.

Prerequisites

Apple Silicon Mac (M1 / M2 / M3 / M4)
macOS 13 or later
At least 64 GB unified memory (the 120B model requires ~60 GB)
Homebrew installed
Node.js 18+ (for Codex CLI)

Part 1 — Install LM Studio

1.1 Download and install the app

Go to https://lmstudio.ai/download and download the macOS (Apple Silicon) installer.
Drag LM Studio.app into your /Applications folder and open it at least once — this bootstraps the lms CLI.

1.2 Add `lms` to your PATH

Open a terminal and run:

npx lmstudio install-cli

Open a new terminal window, then verify:

lms --version

Part 2 — Download the Model

You have two options: download via the LM Studio app GUI, or use the CLI.

Option A — GUI (easiest)

Open LM Studio.
Press ⌘ + Shift + M to open the model search.
Search for gpt-oss-120b-Sonnet-Reasoning-Distilled.
Select the MLX variant and click Download.

Option B — CLI

lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx

Option C — Use a locally downloaded model

If you already have the model folder on disk, create a symlink into LM Studio's model directory:

# Create the target directory
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/

# Symlink (no file copying, saves disk space)
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
  ~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Verify LM Studio detected the model:

lms ls

You should see something like:

LLM                                        PARAMS    ARCH       SIZE        DEVICE
gpt-oss-120b-sonnet-reasoning-distilled    120B      gpt_oss    63.42 GB    Local

Part 3 — Load the Model and Start the Server

3.1 (Optional) Estimate memory usage first

lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled

3.2 Load the model

lms load gpt-oss-120b-sonnet-reasoning-distilled \
  --context-length 32768 \
  --gpu max

--context-length 32768 — Codex needs a large context window.
--gpu max — offloads all layers to Apple Metal GPU (recommended for MLX models).

Wait for: Model loaded successfully.

3.3 Start the local server

lms server start --port 1234

3.4 Verify the API is working

curl http://127.0.0.1:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer lm-studio" \
  -d '{
    "model": "gpt-oss-120b-sonnet-reasoning-distilled",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 50
  }'

You should receive a JSON response with the model's reply.

Part 4 — Install and Configure OpenAI Codex CLI

4.1 Install Codex

npm install -g @openai/codex

Verify:

codex --version

4.2 Configure Codex

Create (or overwrite) the Codex config file:

cat > ~/.codex/config.toml << 'EOF'
# Set this profile as default — note: the key is "profile", not "default_profile"
profile = "gpt-oss-local"

[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"

[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"

[projects."/Users/YOUR_USERNAME/your-project"]
trust_level = "trusted"
EOF

Important notes:

The correct key for a default profile is profile, not default_profile.

my-lmstudio is a custom name. Do not use reserved names: openai, ollama, or lmstudio.

Replace /Users/YOUR_USERNAME/your-project with your actual project path.

wire_api = "responses" tells Codex to use the /v1/responses endpoint, which LM Studio supports natively.

4.3 Run Codex

codex

You should see:

╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0)                               │
│                                                          │
│ model:     gpt-oss-120b-sonnet-reasoning-distilled       │
│ directory: ~/your-project                                │
╰──────────────────────────────────────────────────────────╯

If you see gpt-5.5 in the model field, your profile is not loading. Run with an explicit flag instead:

codex --profile gpt-oss-local

Part 5 — Everyday Workflow

Each time you start a new terminal session, run these two commands before launching Codex:

# 1. Load the model (skip if already loaded)
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max

# 2. Start the server (skip if already running)
lms server start --port 1234

# 3. Launch Codex
codex

To check whether the model is already loaded:

lms ps

To stop the server:

lms server stop

Troubleshooting

Problem	Solution
`lms` command not found	Run `npx lmstudio install-cli`, then open a new terminal
Model not detected by `lms ls`	Check that your model folder is inside `~/Documents/LM Studio/models/<author>/`
Codex shows `gpt-5.5` as model	Use `codex --profile gpt-oss-local` or verify `profile =` in config.toml
Codex request times out	Make sure proxy env vars are unset: `unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY`
`model_providers contains reserved built-in provider IDs`	Rename your provider — avoid `openai`, `ollama`, `lmstudio`
Out of memory when loading	Reduce `--context-length` (e.g. `16384`) or close other apps

在 Mac 上使用 LM Studio 配合 OpenAI Codex（MLX 模型）

本文以 cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 为例，手把手介绍如何在 Apple Silicon Mac 上本地运行大模型并接入 OpenAI Codex CLI。

前提条件

Apple Silicon Mac（M1 / M2 / M3 / M4）
macOS 13 或更高版本
至少 64 GB 统一内存（120B 模型约需 60 GB）
已安装 Homebrew
Node.js 18+（用于 Codex CLI）

第一步 — 安装 LM Studio

1.1 下载并安装 App

前往 https://lmstudio.ai/download，下载 macOS（Apple Silicon）版本安装包。
将 LM Studio.app 拖入 /Applications 文件夹，并至少打开一次，这一步会初始化 lms 命令行工具。

1.2 将 `lms` 加入 PATH

打开终端，运行：

npx lmstudio install-cli

重新打开一个新终端窗口，验证安装：

lms --version

第二步 — 下载模型

有两种方式，根据你的情况选择：

方式 A — 图形界面（最简单）

打开 LM Studio。
按 ⌘ + Shift + M 打开模型搜索。
搜索 gpt-oss-120b-Sonnet-Reasoning-Distilled。
选择 MLX 格式，点击 Download。

方式 B — 命令行下载

lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx

方式 C — 使用本地已有模型

如果模型文件夹已经在磁盘上，用软链接导入，无需复制文件：

# 创建目录
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/

# 创建软链接（不占用额外磁盘空间）
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
  ~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

验证 LM Studio 已识别到模型：

lms ls

正常输出如下：

LLM                                        PARAMS    ARCH       SIZE        DEVICE
gpt-oss-120b-sonnet-reasoning-distilled    120B      gpt_oss    63.42 GB    Local

第三步 — 加载模型并启动服务器

3.1 （可选）加载前预估内存

lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled

3.2 加载模型

lms load gpt-oss-120b-sonnet-reasoning-distilled \
  --context-length 32768 \
  --gpu max

--context-length 32768：Codex 需要较大的上下文窗口。
--gpu max：将所有层卸载到 Apple Metal GPU，MLX 模型推荐此设置。

等待出现 Model loaded successfully 即加载完成。

3.3 启动本地服务器

lms server start --port 1234

3.4 验证 API 是否正常

curl http://127.0.0.1:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer lm-studio" \
  -d '{
    "model": "gpt-oss-120b-sonnet-reasoning-distilled",
    "messages": [{"role": "user", "content": "你好"}],
    "max_tokens": 50
  }'

收到包含模型回复的 JSON 响应即为成功。

第四步 — 安装并配置 OpenAI Codex CLI

4.1 安装 Codex

npm install -g @openai/codex

验证安装：

codex --version

4.2 配置 Codex

创建（或覆盖）Codex 配置文件：

cat > ~/.codex/config.toml << 'EOF'
# 设置默认 profile，注意字段名是 "profile" 而不是 "default_profile"
profile = "gpt-oss-local"

[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"

[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"

[projects."/Users/你的用户名/你的项目路径"]
trust_level = "trusted"
EOF

重要说明：

默认 profile 的字段名是 profile，不是 default_profile。

my-lmstudio 是自定义名称，不能使用保留名称：openai、ollama、lmstudio。

将 /Users/你的用户名/你的项目路径 替换为你实际的项目目录。

wire_api = "responses" 指定 Codex 使用 /v1/responses 接口，LM Studio 原生支持此接口。

4.3 启动 Codex

codex

正常启动后应显示：

╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0)                               │
│                                                          │
│ model:     gpt-oss-120b-sonnet-reasoning-distilled       │
│ directory: ~/你的项目                                     │
╰──────────────────────────────────────────────────────────╯

如果 model 显示的是 gpt-5.5，说明 profile 没有生效，可以手动指定：

codex --profile gpt-oss-local

第五步 — 日常使用流程

每次打开新终端后，按以下顺序执行：

# 1. 加载模型（已加载则跳过）
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max

# 2. 启动服务器（已运行则跳过）
lms server start --port 1234

# 3. 启动 Codex
codex

检查模型是否已加载：

lms ps

停止服务器：

lms server stop

常见问题排查

问题	解决方法
`lms` 命令找不到	运行 `npx lmstudio install-cli`，然后重新打开终端
`lms ls` 看不到模型	确认模型文件夹位于 `~/Documents/LM Studio/models/<作者名>/` 下
Codex 显示 `gpt-5.5`	使用 `codex --profile gpt-oss-local` 或检查 config.toml 里 `profile =` 字段
Codex 请求超时	清除代理环境变量：`unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY`
`reserved built-in provider IDs` 错误	重命名自定义 provider，避免使用 `openai`、`ollama`、`lmstudio`
加载模型时内存不足	减小 `--context-length`（如改为 `16384`），或关闭其他占内存的应用

command line

mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95

OpenAI Local Service

mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95

Python

# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← recommended
    top_p=0.9,
)
print(outputs)

Prompt Format (Harmony)

This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:

<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>

The analysis channel contains the model's internal reasoning. You can display or hide it depending on your use case.
The final channel contains the deliverable response.
Always prompt the model to begin its reply with <|start|>assistant<|channel|>analysis<|message|> to elicit chain-of-thought before the final answer.

Training Details

Parameter	Value
Base model	`gpt-oss-120b-heretic-mxfp4-q8-hi-mlx`
Architecture	MoE (Mixture of Experts), 120B params
Quantization	mxfp4 weights + q8 hi-precision activations
Framework	Apple MLX + mlx-tune
Optimizer	BAdam — BlockOptimizer wrapping AdamW
LoRA rank / alpha	r=16, α=32
LoRA dropout	0.05
LoRA target modules	q/k/v/o/gate/up/down projections (SwitchLinear expert layers)
Router layers	Full-precision unfrozen, trained directly (not LoRA)
Learning rate	1e-5 (cosine decay)
Effective batch size	2 × grad accum 8 = 16
Max steps	1000
BAdam switch mode	`parallel` (head + tail dual-pointer)
BAdam switch every	10 steps
Peak memory during training	~88.2 GB unified memory
Training hardware	Apple M2 Ultra, 192 GB

Dataset

Source: Roman1111111/claude-sonnet-4.6-100000X-filtered
Filter: difficulty == "complex" only → 36,444 samples
Split: 90% train (32,797) / 10% validation (3,644), seed=42
Format: OpenAI-style messages column; assistant turns include a separate reasoning field containing the Claude Sonnet 4.6 chain-of-thought

BAdam — Block-wise Coordinate Descent

Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:

All 36 Transformer layers are partitioned into individual blocks.
At each optimizer step, only the active block's gradients are applied; all others are zeroed.
AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
parallel switch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.

Benchmark Results

Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.

Evaluated on the 1000-step checkpoint (this released model).

ID	Task	Verdict	KW%	Time	Tok/s
math_01	AMC 2025 — n-Norwegian Number	✅ PASS	50%	56.4s	39
math_02	Euler Totient Sum — last 6 digits	✅ PASS	100%	16.2s	35
math_03	Lattice Paths Avoiding the Antidiagonal	✅ PASS	100%	16.6s	34
math_04	Segmented Sieve in [10¹², 10¹²+10⁶]	✅ PASS	100%	25.7s	35
code_01	Median of Two Sorted Arrays	✅ PASS	100%	25.0s	35
code_02	Thread-Safe LRU Cache with TTL	✅ PASS	100%	38.9s	36
code_03	Persistent Segment Tree — K-th Query	✅ PASS	100%	45.9s	35
code_04	Multi-Head Attention + RoPE (NumPy)	✅ PASS	100%	38.0s	33
code_05	Dijkstra vs A* on Large Random Graph	✅ PASS	100%	26.1s	34
logic_01	Knights & Knaves — Exhaustive SAT	✅ PASS	100%	16.0s	36
logic_02	Verify Three Mathematical Claims	✅ PASS	100%	13.2s	34
sci_01	Figure-8 Three-Body Orbit — Energy Conservation	❌ FAIL	100%	29.4s	33
sci_02	Metropolis-Hastings vs HMC Comparison	❌ FAIL	100%	38.3s	33
sci_03	Optimizer Comparison on Rosenbrock	✅ PASS	100%	23.8s	28

Overall: 12/14 PASS | Avg KW: 96.4% | Total benchmark time: 409s

The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.

Inference Speed on M2 Ultra

Metric	Value
Hardware	Apple M2 Ultra, 192 GB unified memory
Peak memory (inference)	~88 GB
Throughput	28–39 tok/s depending on context length
Benchmark average	~34 tok/s

Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.

more details aout test

Limitations

Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
Harmony prompt format required — this model expects the <|channel|> protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.

Citation / Acknowledgements

Base model: gpt-oss-120b-heretic
Training dataset: Roman1111111/claude-sonnet-4.6-100000X-filtered
Optimizer: BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
Framework: Apple MLX · mlx-tune
Reasoning traces distilled from: Claude Sonnet 4.6 (Anthropic)

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled（中文说明）

基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。

推荐温度：0.95 — 本模型基于复杂推理轨迹训练，稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程；高于 1.2 可能影响输出连贯性。

模型简介

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 是 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx 的完整合并微调版本，使用自定义 BAdam（块坐标 Adam）优化器配合 LoRA，基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹，仅保留 difficulty=complex 样本。

这是完整模型上传，无需额外下载基础模型或 adapter 文件，下载即可直接运行。

模型使用 Harmony channel 协议进行逐步推理（analysis 承载内部思维链，final 承载最终回复），并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。

核心优势

🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra（192 GB 统一内存）上维持 28–39 tok/s，这是本模型最突出的特性。
🌡️ 推荐温度：0.95 — 充分释放模型思维链推理的深度与多样性。
🧠 强推理能力 — 自定义高难度测试集 **12/14 通过，平均关键词命中率 96.4%**，覆盖数学竞赛、系统编程、逻辑推理、科学计算。
🍎 完全本地运行 — 无需云端 API，无需 GPU 集群，纯 MLX 在 Mac 上推理。
📦 完整模型 — 已合并权重，直接下载使用，无需配置 adapter 或准备基础模型。
📐 Harmony channel 格式 — analysis（思维链）与 final（最终回复）channel 分离，推理过程透明可见。

快速开始

# 依赖：mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度，以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← 推荐值
    top_p=0.9,
)
print(outputs)

提示词格式（Harmony）

本模型使用 gpt-oss Harmony channel 协议，每条消息必须声明所属 channel：

<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>

analysis channel 包含模型内部推理过程，可根据需要展示或隐藏。
final channel 包含最终输出内容。
始终以 <|start|>assistant<|channel|>analysis<|message|> 引导模型先输出思维链，再给出最终答案。

训练配置

参数	值
基础模型	`gpt-oss-120b-heretic-mxfp4-q8-hi-mlx`
模型架构	MoE（混合专家），120B 参数
量化方式	mxfp4 权重 + q8 高精度激活值
训练框架	Apple MLX + mlx-tune
优化器	BAdam — BlockOptimizer 包装 AdamW
LoRA rank / alpha	r=16, α=32
LoRA dropout	0.05
LoRA 目标模块	q/k/v/o/gate/up/down 投影层（SwitchLinear expert 层）
Router 层	全精度解冻直接微调（不走 LoRA）
学习率	1e-5（余弦衰减）
等效批次大小	2 × 梯度累积 8 = 16
训练步数	1000 步
BAdam 切换模式	`parallel`（头尾双指针并行）
BAdam 切换频率	每 10 步切换一次
最大序列长度	4096 tokens
训练峰值内存	~88.2 GB 统一内存
训练硬件	Apple M2 Ultra，192 GB

数据集

来源： Roman1111111/claude-sonnet-4.6-100000X-filtered
过滤： 仅保留 difficulty == "complex" → 36,444 条
划分： 90% 训练（32,797 条）/ 10% 验证（3,644 条），seed=42
格式： OpenAI 风格 messages 列，assistant 消息含独立 reasoning 字段（Claude Sonnet 4.6 思维链）

BAdam — 块坐标下降原理

在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决：

36 个 Transformer 层各自划分为一个块。
每次优化步只应用当前活跃块的梯度，其余块梯度归零。
AdamW 矩状态惰性初始化——非活跃块不积累优化器状态，峰值内存与纯 LoRA 相当。
parallel 切换模式同时激活头尾两个块，双指针向中间推进，以约 2 倍内存换取更快的层覆盖速度。

测试结果

自定义高难度测试集，共 14 道题，每题执行超时 60 秒，关键词匹配 + 代码执行自动评分。使用本次发布的 1000 步完整模型 评估。

编号	题目	结果	关键词命中	耗时	速度
math_01	AMC 2025 — n-Norwegian 数	✅ 通过	50%	56.4s	39 tok/s
math_02	Euler Totient 求和——后 6 位	✅ 通过	100%	16.2s	35 tok/s
math_03	格路径绕过反对角线	✅ 通过	100%	16.6s	34 tok/s
math_04	[10¹², 10¹²+10⁶] 分段筛法	✅ 通过	100%	25.7s	35 tok/s
code_01	两个有序数组的中位数	✅ 通过	100%	25.0s	35 tok/s
code_02	带 TTL 的线程安全 LRU 缓存	✅ 通过	100%	38.9s	36 tok/s
code_03	持久化线段树 — 第 K 小查询	✅ 通过	100%	45.9s	35 tok/s
code_04	多头注意力 + RoPE（NumPy）	✅ 通过	100%	38.0s	33 tok/s
code_05	Dijkstra vs A* 大随机图对比	✅ 通过	100%	26.1s	34 tok/s
logic_01	骑士与骗子 — 穷举 SAT	✅ 通过	100%	16.0s	36 tok/s
logic_02	验证三个数学命题	✅ 通过	100%	13.2s	34 tok/s
sci_01	八字三体轨道——能量守恒 ODE	❌ 未通过	100%	29.4s	33 tok/s
sci_02	Metropolis-Hastings vs HMC	❌ 未通过	100%	38.3s	33 tok/s
sci_03	Rosenbrock 优化器对比	✅ 通过	100%	23.8s	28 tok/s

总结：12/14 通过 | 平均关键词命中率 96.4% | 测试总耗时 409 秒

两道未通过的题（sci_01、sci_02）关键词命中率均为 100%——模型完全理解了题意，但生成代码存在数值精度或运行时断言边界问题，这是 1000 步早期 checkpoint 的预期表现。

M2 Ultra 推理速度

指标	数值
硬件	Apple M2 Ultra，192 GB 统一内存
峰值内存（推理）	~88 GB
推理速度	28–39 tok/s（因上下文长度而异）
测试集平均	~34 tok/s

单台 Mac 本地运行 120B MoE 模型达到 34 tok/s，无需云端、无需量化妥协——这是选择本模型的核心理由。

局限性

仅支持 Apple Silicon — 针对 MLX 量化优化，不兼容 CUDA/ROCm，如需其他平台运行需重新量化。
1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
无 RLHF — 纯监督蒸馏微调，安全对齐强度弱于商业指令模型，使用时请注意。
必须使用 Harmony 提示格式 — 模型期待 <|channel|> 协议，标准 ChatML 或 Alpaca 格式会导致输出质量下降。

引用与致谢

基础模型：gpt-oss-120b-heretic
训练数据集：Roman1111111/claude-sonnet-4.6-100000X-filtered
优化器：BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
训练框架：Apple MLX · mlx-tune
推理轨迹蒸馏自：Claude Sonnet 4.6（Anthropic）

Downloads last month: 1,555

Safetensors

Model size

117B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Paper for cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

On composition and decomposition operations for vector spaces, graphs and matroids

Paper • 2305.16354 • Published Jul 14, 2023