Instructions to use cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir gpt-oss-120b-Sonnet-Reasoning-Distilled cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
- Using LM Studio with OpenAI Codex on Mac (MLX Models)
- 在 Mac 上使用 LM Studio 配合 OpenAI Codex(MLX 模型)
- command line
- OpenAI Local Service
- Python
- cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled(中文说明)
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.
Recommended temperature:
0.95— this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.
Model Summary
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.
This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.
The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.
Highlights
- 🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
- 🌡️ Recommended temperature:
0.95— unlocks the full depth of the model's reasoning traces. - 🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
- 🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
- 📦 Full model — merged weights included. No base model or adapter setup required.
- 📐 Harmony channel format —
analysis(chain-of-thought) andfinal(response) channel separation, making the reasoning process explicit and inspectable.
Quick Start
Using LM Studio with OpenAI Codex on Mac (MLX Models)
A step-by-step guide to running cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled locally on Apple Silicon Mac and connecting it to OpenAI Codex CLI.
Prerequisites
- Apple Silicon Mac (M1 / M2 / M3 / M4)
- macOS 13 or later
- At least 64 GB unified memory (the 120B model requires ~60 GB)
- Homebrew installed
- Node.js 18+ (for Codex CLI)
Part 1 — Install LM Studio
1.1 Download and install the app
Go to https://lmstudio.ai/download and download the macOS (Apple Silicon) installer.
Drag LM Studio.app into your /Applications folder and open it at least once — this bootstraps the lms CLI.
1.2 Add lms to your PATH
Open a terminal and run:
npx lmstudio install-cli
Open a new terminal window, then verify:
lms --version
Part 2 — Download the Model
You have two options: download via the LM Studio app GUI, or use the CLI.
Option A — GUI (easiest)
- Open LM Studio.
- Press ⌘ + Shift + M to open the model search.
- Search for
gpt-oss-120b-Sonnet-Reasoning-Distilled. - Select the MLX variant and click Download.
Option B — CLI
lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx
Option C — Use a locally downloaded model
If you already have the model folder on disk, create a symlink into LM Studio's model directory:
# Create the target directory
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/
# Symlink (no file copying, saves disk space)
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
Verify LM Studio detected the model:
lms ls
You should see something like:
LLM PARAMS ARCH SIZE DEVICE
gpt-oss-120b-sonnet-reasoning-distilled 120B gpt_oss 63.42 GB Local
Part 3 — Load the Model and Start the Server
3.1 (Optional) Estimate memory usage first
lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled
3.2 Load the model
lms load gpt-oss-120b-sonnet-reasoning-distilled \
--context-length 32768 \
--gpu max
--context-length 32768— Codex needs a large context window.--gpu max— offloads all layers to Apple Metal GPU (recommended for MLX models).
Wait for: Model loaded successfully.
3.3 Start the local server
lms server start --port 1234
3.4 Verify the API is working
curl http://127.0.0.1:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer lm-studio" \
-d '{
"model": "gpt-oss-120b-sonnet-reasoning-distilled",
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 50
}'
You should receive a JSON response with the model's reply.
Part 4 — Install and Configure OpenAI Codex CLI
4.1 Install Codex
npm install -g @openai/codex
Verify:
codex --version
4.2 Configure Codex
Create (or overwrite) the Codex config file:
cat > ~/.codex/config.toml << 'EOF'
# Set this profile as default — note: the key is "profile", not "default_profile"
profile = "gpt-oss-local"
[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"
[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"
[projects."/Users/YOUR_USERNAME/your-project"]
trust_level = "trusted"
EOF
Important notes:
- The correct key for a default profile is
profile, notdefault_profile.my-lmstudiois a custom name. Do not use reserved names:openai,ollama, orlmstudio.- Replace
/Users/YOUR_USERNAME/your-projectwith your actual project path.wire_api = "responses"tells Codex to use the/v1/responsesendpoint, which LM Studio supports natively.
4.3 Run Codex
codex
You should see:
╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0) │
│ │
│ model: gpt-oss-120b-sonnet-reasoning-distilled │
│ directory: ~/your-project │
╰──────────────────────────────────────────────────────────╯
If you see gpt-5.5 in the model field, your profile is not loading. Run with an explicit flag instead:
codex --profile gpt-oss-local
Part 5 — Everyday Workflow
Each time you start a new terminal session, run these two commands before launching Codex:
# 1. Load the model (skip if already loaded)
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max
# 2. Start the server (skip if already running)
lms server start --port 1234
# 3. Launch Codex
codex
To check whether the model is already loaded:
lms ps
To stop the server:
lms server stop
Troubleshooting
| Problem | Solution |
|---|---|
lms command not found |
Run npx lmstudio install-cli, then open a new terminal |
Model not detected by lms ls |
Check that your model folder is inside ~/Documents/LM Studio/models/<author>/ |
Codex shows gpt-5.5 as model |
Use codex --profile gpt-oss-local or verify profile = in config.toml |
| Codex request times out | Make sure proxy env vars are unset: unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY |
model_providers contains reserved built-in provider IDs |
Rename your provider — avoid openai, ollama, lmstudio |
| Out of memory when loading | Reduce --context-length (e.g. 16384) or close other apps |
在 Mac 上使用 LM Studio 配合 OpenAI Codex(MLX 模型)
本文以 cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 为例,手把手介绍如何在 Apple Silicon Mac 上本地运行大模型并接入 OpenAI Codex CLI。
前提条件
- Apple Silicon Mac(M1 / M2 / M3 / M4)
- macOS 13 或更高版本
- 至少 64 GB 统一内存(120B 模型约需 60 GB)
- 已安装 Homebrew
- Node.js 18+(用于 Codex CLI)
第一步 — 安装 LM Studio
1.1 下载并安装 App
前往 https://lmstudio.ai/download,下载 macOS(Apple Silicon)版本安装包。
将 LM Studio.app 拖入 /Applications 文件夹,并至少打开一次,这一步会初始化 lms 命令行工具。
1.2 将 lms 加入 PATH
打开终端,运行:
npx lmstudio install-cli
重新打开一个新终端窗口,验证安装:
lms --version
第二步 — 下载模型
有两种方式,根据你的情况选择:
方式 A — 图形界面(最简单)
- 打开 LM Studio。
- 按 ⌘ + Shift + M 打开模型搜索。
- 搜索
gpt-oss-120b-Sonnet-Reasoning-Distilled。 - 选择 MLX 格式,点击 Download。
方式 B — 命令行下载
lms get cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --mlx
方式 C — 使用本地已有模型
如果模型文件夹已经在磁盘上,用软链接导入,无需复制文件:
# 创建目录
mkdir -p ~/Documents/LM\ Studio/models/cloudyu/
# 创建软链接(不占用额外磁盘空间)
ln -s /path/to/your/gpt-oss-120b-Sonnet-Reasoning-Distilled \
~/Documents/LM\ Studio/models/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
验证 LM Studio 已识别到模型:
lms ls
正常输出如下:
LLM PARAMS ARCH SIZE DEVICE
gpt-oss-120b-sonnet-reasoning-distilled 120B gpt_oss 63.42 GB Local
第三步 — 加载模型并启动服务器
3.1 (可选)加载前预估内存
lms load --estimate-only gpt-oss-120b-sonnet-reasoning-distilled
3.2 加载模型
lms load gpt-oss-120b-sonnet-reasoning-distilled \
--context-length 32768 \
--gpu max
--context-length 32768:Codex 需要较大的上下文窗口。--gpu max:将所有层卸载到 Apple Metal GPU,MLX 模型推荐此设置。
等待出现 Model loaded successfully 即加载完成。
3.3 启动本地服务器
lms server start --port 1234
3.4 验证 API 是否正常
curl http://127.0.0.1:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer lm-studio" \
-d '{
"model": "gpt-oss-120b-sonnet-reasoning-distilled",
"messages": [{"role": "user", "content": "你好"}],
"max_tokens": 50
}'
收到包含模型回复的 JSON 响应即为成功。
第四步 — 安装并配置 OpenAI Codex CLI
4.1 安装 Codex
npm install -g @openai/codex
验证安装:
codex --version
4.2 配置 Codex
创建(或覆盖)Codex 配置文件:
cat > ~/.codex/config.toml << 'EOF'
# 设置默认 profile,注意字段名是 "profile" 而不是 "default_profile"
profile = "gpt-oss-local"
[model_providers.my-lmstudio]
name = "LM Studio Local"
base_url = "http://127.0.0.1:1234/v1"
api_key = "lm-studio"
[profiles.gpt-oss-local]
model_provider = "my-lmstudio"
model = "gpt-oss-120b-sonnet-reasoning-distilled"
context_window = 32000
wire_api = "responses"
[projects."/Users/你的用户名/你的项目路径"]
trust_level = "trusted"
EOF
重要说明:
- 默认 profile 的字段名是
profile,不是default_profile。my-lmstudio是自定义名称,不能使用保留名称:openai、ollama、lmstudio。- 将
/Users/你的用户名/你的项目路径替换为你实际的项目目录。wire_api = "responses"指定 Codex 使用/v1/responses接口,LM Studio 原生支持此接口。
4.3 启动 Codex
codex
正常启动后应显示:
╭──────────────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.128.0) │
│ │
│ model: gpt-oss-120b-sonnet-reasoning-distilled │
│ directory: ~/你的项目 │
╰──────────────────────────────────────────────────────────╯
如果 model 显示的是 gpt-5.5,说明 profile 没有生效,可以手动指定:
codex --profile gpt-oss-local
第五步 — 日常使用流程
每次打开新终端后,按以下顺序执行:
# 1. 加载模型(已加载则跳过)
lms load gpt-oss-120b-sonnet-reasoning-distilled --context-length 32768 --gpu max
# 2. 启动服务器(已运行则跳过)
lms server start --port 1234
# 3. 启动 Codex
codex
检查模型是否已加载:
lms ps
停止服务器:
lms server stop
常见问题排查
| 问题 | 解决方法 |
|---|---|
lms 命令找不到 |
运行 npx lmstudio install-cli,然后重新打开终端 |
lms ls 看不到模型 |
确认模型文件夹位于 ~/Documents/LM Studio/models/<作者名>/ 下 |
Codex 显示 gpt-5.5 |
使用 codex --profile gpt-oss-local 或检查 config.toml 里 profile = 字段 |
| Codex 请求超时 | 清除代理环境变量:unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY |
reserved built-in provider IDs 错误 |
重命名自定义 provider,避免使用 openai、ollama、lmstudio |
| 加载模型时内存不足 | 减小 --context-length(如改为 16384),或关闭其他占内存的应用 |
command line
mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95
OpenAI Local Service
mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95
Python
# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune
from mlx_tune import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
max_seq_length=4096,
load_in_4bit=True,
)
prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""
outputs = model.generate(
prompt,
max_new_tokens=2048,
temperature=0.95, # ← recommended
top_p=0.9,
)
print(outputs)
Prompt Format (Harmony)
This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:
<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>
- The
analysischannel contains the model's internal reasoning. You can display or hide it depending on your use case. - The
finalchannel contains the deliverable response. - Always prompt the model to begin its reply with
<|start|>assistant<|channel|>analysis<|message|>to elicit chain-of-thought before the final answer.
Training Details
| Parameter | Value |
|---|---|
| Base model | gpt-oss-120b-heretic-mxfp4-q8-hi-mlx |
| Architecture | MoE (Mixture of Experts), 120B params |
| Quantization | mxfp4 weights + q8 hi-precision activations |
| Framework | Apple MLX + mlx-tune |
| Optimizer | BAdam — BlockOptimizer wrapping AdamW |
| LoRA rank / alpha | r=16, α=32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q/k/v/o/gate/up/down projections (SwitchLinear expert layers) |
| Router layers | Full-precision unfrozen, trained directly (not LoRA) |
| Learning rate | 1e-5 (cosine decay) |
| Effective batch size | 2 × grad accum 8 = 16 |
| Max steps | 1000 |
| BAdam switch mode | parallel (head + tail dual-pointer) |
| BAdam switch every | 10 steps |
| Peak memory during training | ~88.2 GB unified memory |
| Training hardware | Apple M2 Ultra, 192 GB |
Dataset
- Source:
Roman1111111/claude-sonnet-4.6-100000X-filtered - Filter:
difficulty == "complex"only → 36,444 samples - Split: 90% train (32,797) / 10% validation (3,644), seed=42
- Format: OpenAI-style
messagescolumn; assistant turns include a separatereasoningfield containing the Claude Sonnet 4.6 chain-of-thought
BAdam — Block-wise Coordinate Descent
Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:
- All 36 Transformer layers are partitioned into individual blocks.
- At each optimizer step, only the active block's gradients are applied; all others are zeroed.
- AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
parallelswitch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.
Benchmark Results
Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.
Evaluated on the 1000-step checkpoint (this released model).
| ID | Task | Verdict | KW% | Time | Tok/s |
|---|---|---|---|---|---|
| math_01 | AMC 2025 — n-Norwegian Number | ✅ PASS | 50% | 56.4s | 39 |
| math_02 | Euler Totient Sum — last 6 digits | ✅ PASS | 100% | 16.2s | 35 |
| math_03 | Lattice Paths Avoiding the Antidiagonal | ✅ PASS | 100% | 16.6s | 34 |
| math_04 | Segmented Sieve in [10¹², 10¹²+10⁶] | ✅ PASS | 100% | 25.7s | 35 |
| code_01 | Median of Two Sorted Arrays | ✅ PASS | 100% | 25.0s | 35 |
| code_02 | Thread-Safe LRU Cache with TTL | ✅ PASS | 100% | 38.9s | 36 |
| code_03 | Persistent Segment Tree — K-th Query | ✅ PASS | 100% | 45.9s | 35 |
| code_04 | Multi-Head Attention + RoPE (NumPy) | ✅ PASS | 100% | 38.0s | 33 |
| code_05 | Dijkstra vs A* on Large Random Graph | ✅ PASS | 100% | 26.1s | 34 |
| logic_01 | Knights & Knaves — Exhaustive SAT | ✅ PASS | 100% | 16.0s | 36 |
| logic_02 | Verify Three Mathematical Claims | ✅ PASS | 100% | 13.2s | 34 |
| sci_01 | Figure-8 Three-Body Orbit — Energy Conservation | ❌ FAIL | 100% | 29.4s | 33 |
| sci_02 | Metropolis-Hastings vs HMC Comparison | ❌ FAIL | 100% | 38.3s | 33 |
| sci_03 | Optimizer Comparison on Rosenbrock | ✅ PASS | 100% | 23.8s | 28 |
Overall: 12/14 PASS | Avg KW: 96.4% | Total benchmark time: 409s
The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.
Inference Speed on M2 Ultra
| Metric | Value |
|---|---|
| Hardware | Apple M2 Ultra, 192 GB unified memory |
| Peak memory (inference) | ~88 GB |
| Throughput | 28–39 tok/s depending on context length |
| Benchmark average | ~34 tok/s |
Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.
more details aout test
Limitations
- Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
- 1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
- No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
- Harmony prompt format required — this model expects the
<|channel|>protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.
Citation / Acknowledgements
- Base model:
gpt-oss-120b-heretic - Training dataset:
Roman1111111/claude-sonnet-4.6-100000X-filtered - Optimizer: BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
- Framework: Apple MLX · mlx-tune
- Reasoning traces distilled from: Claude Sonnet 4.6 (Anthropic)
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled(中文说明)
基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。
推荐温度:
0.95— 本模型基于复杂推理轨迹训练,稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程;高于 1.2 可能影响输出连贯性。
模型简介
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 是 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx 的完整合并微调版本,使用自定义 BAdam(块坐标 Adam)优化器配合 LoRA,基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹,仅保留 difficulty=complex 样本。
这是完整模型上传,无需额外下载基础模型或 adapter 文件,下载即可直接运行。
模型使用 Harmony channel 协议进行逐步推理(analysis 承载内部思维链,final 承载最终回复),并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。
核心优势
- 🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra(192 GB 统一内存)上维持 28–39 tok/s,这是本模型最突出的特性。
- 🌡️ 推荐温度:
0.95— 充分释放模型思维链推理的深度与多样性。 - 🧠 强推理能力 — 自定义高难度测试集 **12/14 通过,平均关键词命中率 96.4%**,覆盖数学竞赛、系统编程、逻辑推理、科学计算。
- 🍎 完全本地运行 — 无需云端 API,无需 GPU 集群,纯 MLX 在 Mac 上推理。
- 📦 完整模型 — 已合并权重,直接下载使用,无需配置 adapter 或准备基础模型。
- 📐 Harmony channel 格式 —
analysis(思维链)与final(最终回复)channel 分离,推理过程透明可见。
快速开始
# 依赖:mlx-tune + Apple MLX
# pip install mlx-tune
from mlx_tune import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
max_seq_length=4096,
load_in_4bit=True,
)
prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度,以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""
outputs = model.generate(
prompt,
max_new_tokens=2048,
temperature=0.95, # ← 推荐值
top_p=0.9,
)
print(outputs)
提示词格式(Harmony)
本模型使用 gpt-oss Harmony channel 协议,每条消息必须声明所属 channel:
<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>
analysischannel 包含模型内部推理过程,可根据需要展示或隐藏。finalchannel 包含最终输出内容。- 始终以
<|start|>assistant<|channel|>analysis<|message|>引导模型先输出思维链,再给出最终答案。
训练配置
| 参数 | 值 |
|---|---|
| 基础模型 | gpt-oss-120b-heretic-mxfp4-q8-hi-mlx |
| 模型架构 | MoE(混合专家),120B 参数 |
| 量化方式 | mxfp4 权重 + q8 高精度激活值 |
| 训练框架 | Apple MLX + mlx-tune |
| 优化器 | BAdam — BlockOptimizer 包装 AdamW |
| LoRA rank / alpha | r=16, α=32 |
| LoRA dropout | 0.05 |
| LoRA 目标模块 | q/k/v/o/gate/up/down 投影层(SwitchLinear expert 层) |
| Router 层 | 全精度解冻直接微调(不走 LoRA) |
| 学习率 | 1e-5(余弦衰减) |
| 等效批次大小 | 2 × 梯度累积 8 = 16 |
| 训练步数 | 1000 步 |
| BAdam 切换模式 | parallel(头尾双指针并行) |
| BAdam 切换频率 | 每 10 步切换一次 |
| 最大序列长度 | 4096 tokens |
| 训练峰值内存 | ~88.2 GB 统一内存 |
| 训练硬件 | Apple M2 Ultra,192 GB |
数据集
- 来源:
Roman1111111/claude-sonnet-4.6-100000X-filtered - 过滤: 仅保留
difficulty == "complex"→ 36,444 条 - 划分: 90% 训练(32,797 条)/ 10% 验证(3,644 条),seed=42
- 格式: OpenAI 风格
messages列,assistant 消息含独立reasoning字段(Claude Sonnet 4.6 思维链)
BAdam — 块坐标下降原理
在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决:
- 36 个 Transformer 层各自划分为一个块。
- 每次优化步只应用当前活跃块的梯度,其余块梯度归零。
- AdamW 矩状态惰性初始化——非活跃块不积累优化器状态,峰值内存与纯 LoRA 相当。
parallel切换模式同时激活头尾两个块,双指针向中间推进,以约 2 倍内存换取更快的层覆盖速度。
测试结果
自定义高难度测试集,共 14 道题,每题执行超时 60 秒,关键词匹配 + 代码执行自动评分。 使用本次发布的 1000 步完整模型 评估。
| 编号 | 题目 | 结果 | 关键词命中 | 耗时 | 速度 |
|---|---|---|---|---|---|
| math_01 | AMC 2025 — n-Norwegian 数 | ✅ 通过 | 50% | 56.4s | 39 tok/s |
| math_02 | Euler Totient 求和——后 6 位 | ✅ 通过 | 100% | 16.2s | 35 tok/s |
| math_03 | 格路径绕过反对角线 | ✅ 通过 | 100% | 16.6s | 34 tok/s |
| math_04 | [10¹², 10¹²+10⁶] 分段筛法 | ✅ 通过 | 100% | 25.7s | 35 tok/s |
| code_01 | 两个有序数组的中位数 | ✅ 通过 | 100% | 25.0s | 35 tok/s |
| code_02 | 带 TTL 的线程安全 LRU 缓存 | ✅ 通过 | 100% | 38.9s | 36 tok/s |
| code_03 | 持久化线段树 — 第 K 小查询 | ✅ 通过 | 100% | 45.9s | 35 tok/s |
| code_04 | 多头注意力 + RoPE(NumPy) | ✅ 通过 | 100% | 38.0s | 33 tok/s |
| code_05 | Dijkstra vs A* 大随机图对比 | ✅ 通过 | 100% | 26.1s | 34 tok/s |
| logic_01 | 骑士与骗子 — 穷举 SAT | ✅ 通过 | 100% | 16.0s | 36 tok/s |
| logic_02 | 验证三个数学命题 | ✅ 通过 | 100% | 13.2s | 34 tok/s |
| sci_01 | 八字三体轨道——能量守恒 ODE | ❌ 未通过 | 100% | 29.4s | 33 tok/s |
| sci_02 | Metropolis-Hastings vs HMC | ❌ 未通过 | 100% | 38.3s | 33 tok/s |
| sci_03 | Rosenbrock 优化器对比 | ✅ 通过 | 100% | 23.8s | 28 tok/s |
总结:12/14 通过 | 平均关键词命中率 96.4% | 测试总耗时 409 秒
两道未通过的题(sci_01、sci_02)关键词命中率均为 100%——模型完全理解了题意,但生成代码存在数值精度或运行时断言边界问题,这是 1000 步早期 checkpoint 的预期表现。
M2 Ultra 推理速度
| 指标 | 数值 |
|---|---|
| 硬件 | Apple M2 Ultra,192 GB 统一内存 |
| 峰值内存(推理) | ~88 GB |
| 推理速度 | 28–39 tok/s(因上下文长度而异) |
| 测试集平均 | ~34 tok/s |
单台 Mac 本地运行 120B MoE 模型达到 34 tok/s,无需云端、无需量化妥协——这是选择本模型的核心理由。
局限性
- 仅支持 Apple Silicon — 针对 MLX 量化优化,不兼容 CUDA/ROCm,如需其他平台运行需重新量化。
- 1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
- 无 RLHF — 纯监督蒸馏微调,安全对齐强度弱于商业指令模型,使用时请注意。
- 必须使用 Harmony 提示格式 — 模型期待
<|channel|>协议,标准 ChatML 或 Alpaca 格式会导致输出质量下降。
引用与致谢
- 基础模型:
gpt-oss-120b-heretic - 训练数据集:
Roman1111111/claude-sonnet-4.6-100000X-filtered - 优化器:BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
- 训练框架:Apple MLX · mlx-tune
- 推理轨迹蒸馏自:Claude Sonnet 4.6(Anthropic)
- Downloads last month
- 1,555
4-bit