hmahadik commited on
Commit
2a22670
·
verified ·
1 Parent(s): eef4acc

v10: unified set_lights, named-args output, 6 tools

Browse files

- Replace 3 LED tools (set_status_led / blink_status_led / set_neopixel_effect) with one hardware-agnostic set_lights(color?, effect?, state?). Dispatcher resolves to HAT 3-LED indicators or WLED strip/ring at runtime.
- Switch tool-call output to named-args format per Mercedes-Benz Octopus v2 (arXiv 2501.02342): <tool_0>(color=\"red\", state=\"on\")<end>. Optional args absent when user didn't imply them.
- Tool surface: set_lights, play_buzzer, set_alarm, cancel_alarm, get_system_status, respond.
- Training: 5,222 train / 920 eval. Final eval loss 0.046, mean token acc 97.9%. Held-out smoke 35/36 (97.2%).
- On-device (Coralboard, 2-core A55, Q5_K_M): cold prefill 0.48 s, decode ~9.7 tok/s, end-to-end first turn ~3.4 s.

Files changed (6) hide show
  1. .gitattributes +1 -0
  2. Modelfile +7 -12
  3. README.md +159 -125
  4. functiongemma-physical-ai-v10-Q5_K_M.gguf +3 -0
  5. token_map.json +17 -31
  6. tools.json +15 -71
.gitattributes CHANGED
@@ -43,3 +43,4 @@ functiongemma-physical-ai-v7-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
43
  moonshine/decoder.vmfb filter=lfs diff=lfs merge=lfs -text
44
  moonshine/decoder_with_past.vmfb filter=lfs diff=lfs merge=lfs -text
45
  functiongemma-physical-ai-v9-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
 
 
43
  moonshine/decoder.vmfb filter=lfs diff=lfs merge=lfs -text
44
  moonshine/decoder_with_past.vmfb filter=lfs diff=lfs merge=lfs -text
45
  functiongemma-physical-ai-v9-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
46
+ functiongemma-physical-ai-v10-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Modelfile CHANGED
@@ -1,20 +1,15 @@
1
- # FunctionGemma 270M Physical AI v7, function-token format
2
- # Function tokens (<tool_N>) + <end> terminator. ~8-15 output tokens per call.
3
- # Optimized for CPU decode on small Cortex-A55 / similar edge targets.
4
-
5
- FROM ./functiongemma-physical-ai-v7-Q5_K_M.gguf
6
 
7
  PARAMETER temperature 0
8
  PARAMETER top_p 1
9
  PARAMETER num_ctx 1024
10
  PARAMETER num_predict 80
11
 
12
- # Stop on the turn-level markers ONLY, not on <end>. Multi-tool sequences
13
- # emit <tool_A>(args)<end><tool_B>(args)<end>, and stopping at the first
14
- # <end> truncates the second call. <end_of_turn> + <eos> are the right
15
- # stops for both single- and multi-tool output.
16
  PARAMETER stop "<end_of_turn>"
17
  PARAMETER stop "<eos>"
18
-
19
- # Use base model's chat template — training data is in messages+tools form,
20
- # the tokenizer's chat_template.jinja already handles it.
 
1
+ # Coral FunctionGemma v10 6-tool schema, Octopus v2, named-args format
2
+ # set_lights unified (color/effect/state). respond = <tool_5>.
3
+ FROM ./functiongemma-physical-ai-v10-Q5_K_M.gguf
 
 
4
 
5
  PARAMETER temperature 0
6
  PARAMETER top_p 1
7
  PARAMETER num_ctx 1024
8
  PARAMETER num_predict 80
9
 
10
+ # Do NOT stop on <end> that terminator marks the end of one tool call,
11
+ # but multi-tool sequences emit `<tool_A>(args)<end><tool_B>(args)<end>` and
12
+ # stopping at the first <end> truncates legitimate multi-tool output.
13
+ # <end_of_turn> + <eos> mark turn-level completion.
14
  PARAMETER stop "<end_of_turn>"
15
  PARAMETER stop "<eos>"
 
 
 
README.md CHANGED
@@ -17,77 +17,85 @@ pipeline_tag: text-generation
17
  inference: false
18
  ---
19
 
20
- # FunctionGemma 270M — Physical AI (v9, Octopus v2)
21
 
22
  Fine-tuned [`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it)
23
  for voice-controlled physical-AI / household-IoT actions on a Synaptics
24
  SL2619 "Coral" edge board (Google IO 2026 demo).
25
 
26
- | Revision | File | Tool count | Headline result |
27
- |----------|------|-----------:|-----------------|
28
- | **v9 (current)** | [`functiongemma-physical-ai-v9-Q5_K_M.gguf`](./functiongemma-physical-ai-v9-Q5_K_M.gguf) | 8 | 30/30 (100 %) routing on held-out smoke prompts; **0.55 s cold prefill** on the 2-core A55 (vs ~57 s for v7's schema-in-prompt build — **105× faster**). |
29
- | v7 (legacy) | [`functiongemma-physical-ai-v7-Q5_K_M.gguf`](./functiongemma-physical-ai-v7-Q5_K_M.gguf) | 10 | 86.8 % overall on a 250-row eval; schema-in-prompt build. |
30
- | v6 (legacy) | [`functiongemma-physical-ai-v6-Q5_K_M.gguf`](./functiongemma-physical-ai-v6-Q5_K_M.gguf) | 11 | Camera + vision dropped from earlier revs. Schema-in-prompt build. |
31
- | v4c (legacy) | [`functiongemma-physical-ai-Q4_K_M.gguf`](./functiongemma-physical-ai-Q4_K_M.gguf) | 13 | Earliest published checkpoint. |
32
 
33
- Schema ships as [`tools.json`](./tools.json) (8 tools, v9). Token-to-tool
34
- mapping is in [`token_map.json`](./token_map.json).
35
 
36
- ## What changed in v9
37
 
38
- v9 is a structural rewrite of the training pipeline, not just a dataset
39
- refresh. Earlier revisions used the upstream FunctionGemma prompt format,
40
- which injects the full tool schema (~1088 tokens) into every request as a
41
- `<start_function_declaration>` developer turn. On a 2-core Cortex-A55 that
42
- prefill cost ~57 s on the first turn incompatible with a sub-second
43
- voice UX.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- v9 follows [Octopus v2](https://arxiv.org/abs/2404.01744) end to end:
 
 
 
46
 
47
- | | Pre-v9 (schema-in-prompt) | v9 (Octopus v2) |
48
- |---|---|---|
49
- | Prompt format | `<start_of_turn>developer\n<start_function_declaration>...{schema}...<end_function_declaration>\n<end_of_turn>\n<start_of_turn>user\n{q}<end_of_turn>\n<start_of_turn>model\n` | `<start_of_turn>user\n{q}<end_of_turn>\n<start_of_turn>model\n` |
50
- | Tokens per prompt | ~1088 | ~13 |
51
- | Cold prefill on SL2619 (2-core A55) | ~57 s | **~0.55 s** |
52
- | Tool routing | learned from in-context schema | learned from `<tool_N>` token weights |
53
- | Training data shape | `{tools, messages: [dev, user, asst]}` with schema embedded | `{input, output, split}` — flat |
54
 
55
- The schema in `tools.json` is still the source of truth for dispatcher
56
- arg validation and is embedded in the GGUF metadata for schema-drift
57
- checks, but it is **not** loaded into the inference prompt.
 
 
 
58
 
59
- ## Tool surface (v9, 8 tools)
60
 
61
- | Token | Name | Args | Purpose |
62
- |---|---|---|---|
63
- | `<tool_0>` | `set_status_led` | `led`, `state`, `brightness?` | On/off one or all HAT status LEDs |
64
- | `<tool_1>` | `blink_status_led` | `led`, `count?`, `speed?` | Discrete blink |
65
- | `<tool_2>` | `set_neopixel_effect` | `effect`, `color?`, `palette?`, `speed?`, `intensity?` | Animated effect on the ring |
66
- | `<tool_3>` | `play_buzzer` | `pattern` | `beep`, `double_beep`, `chirp`, `siren`, `alarm`, `success`, `error` |
67
- | `<tool_4>` | `set_alarm` | `duration` or `time`, `label?` | Timer |
68
- | `<tool_5>` | `cancel_alarm` | `label?` | Cancel one or all |
69
- | `<tool_6>` | `get_system_status` | `metric` | `cpu`, `memory`, `temperature`, `npu`, or `all` |
70
- | `<tool_7>` | `respond` | `message` | Natural-language fallback when no tool fits |
71
-
72
- Surface routing keyword: `set_neopixel_effect` requires the literal
73
- substring `neopixels` in the user input. LED-vs-ring ambiguous prompts
74
- ("turn off the lights") route to `respond()` asking the user to
75
- disambiguate.
76
-
77
- ## Output format functional tokens
78
-
79
- Tool calls emit as **functional tokens**: each tool name compiles to a
80
- single special-vocabulary token (`<tool_0>` `<tool_7>`) plus a single
81
- `<end>` terminator. A complete call decodes in 8–15 output tokens, vs
82
- ~3080 for the upstream native FunctionGemma
83
- `<start_function_call>call:NAME{...}<end_function_call>` syntax. On a
84
- 2-core Cortex-A55 this is the difference between sub-second and 2–5 s
85
- voice-UX latency.
86
-
87
- Sample output: `<tool_0>("red","on")<end>` for `set_status_led(led="red", state="on")`.
88
-
89
- > ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or `<eos>`),
90
- > NOT on `<end>`. The v9 model can emit multi-tool sequences
91
  > `<tool_A>(args)<end><tool_B>(args)<end>`, so stopping at the first
92
  > `<end>` truncates legitimate multi-tool output.
93
 
@@ -95,20 +103,21 @@ Sample output: `<tool_0>("red","on")<end>` for `set_status_led(led="red", state=
95
 
96
  ```bash
97
  hf download BrinqAI/functiongemma-270m-physical-ai \
98
- functiongemma-physical-ai-v9-Q5_K_M.gguf Modelfile tools.json token_map.json \
99
  --local-dir ./fg-physical-ai
100
 
101
  cd fg-physical-ai
102
  ollama create functiongemma-physical-ai -f Modelfile
103
  ```
104
 
105
- The shipped `Modelfile` bakes in the stop tokens (`<end_of_turn>`, `<eos>`)
106
- and decode parameters (`temperature=0`, `num_ctx=1024`, `num_predict=80`).
 
107
 
108
  ## Calling the model
109
 
110
- The v9 model expects a **bare user turn** — no schema, no tools list. Send
111
- to Ollama with `raw=true`:
112
 
113
  ```python
114
  import json
@@ -120,6 +129,8 @@ MODEL = "functiongemma-physical-ai"
120
 
121
  reverse_token_map = json.load(open("token_map.json"))["reverse"]
122
 
 
 
123
 
124
  def build_prompt(user_text: str) -> str:
125
  return (
@@ -150,18 +161,19 @@ def call_model(user_text: str) -> str:
150
  return json.loads(resp.read())["response"]
151
 
152
 
153
- def parse_call(raw: str) -> tuple[str | None, str]:
154
- """Return (tool_name, raw_args_string). tool_name is None on parse fail."""
155
  m = re.match(r"\s*(<tool_\d+>)\((.*?)\)<end>", raw)
156
  if not m:
157
- return None, ""
158
- tok, args = m.group(1), m.group(2)
159
- return reverse_token_map.get(tok), args
 
160
 
161
 
162
- raw = call_model("turn the red LED on")
163
- print(raw) # e.g. '<tool_0>("red","on")<end>'
164
- print(parse_call(raw)) # ('set_status_led', '"red","on"')
165
  ```
166
 
167
  For `llama-cpp-python` directly, use `detokenize(..., special=True)` so
@@ -170,43 +182,46 @@ stripped.
170
 
171
  ## Training data
172
 
173
- v9's training data was generated from Haiku-authored phrasing templates
174
  crossed with deterministic entity pools, then lightly augmented with
175
  Moonshine-flavored ASR noise (dropped function words, lowercased traces,
176
- filler-word prepends). The shape matches Brinq's SmartPanel v15 trainer:
177
- flat `{input, output, split}` records, no tools / messages array.
178
 
179
- | | v9 |
180
  |---|---|
181
- | Train rows | 6,127 |
182
- | Eval rows | 1,339 |
183
- | Tools | 8 |
184
- | Multi-tool fraction | low single-tool emphasis; multi-tool routines composed at dispatch time |
185
- | Augmentation | Moonshine-sim noise on ~30 % of records |
186
-
187
- Per-tool train counts (range 217–1,199; cancel_alarm + play_buzzer are the
188
- narrowest classes because their natural phrasing variation is smaller —
189
- not a coverage gap).
 
 
 
190
 
191
  ## Methodology
192
 
193
- Direct port of the SmartPanel v15 trainer:
194
-
195
  - **Full bf16 fine-tune** (no LoRA).
196
- - **Functional tokens**: `<tool_0>` … `<tool_7>` + `<end>` added as
197
- `additional_special_tokens`; new embeddings **mean-initialized** from the
198
- existing input embedding matrix (random init under-converges on small
199
- datasets).
200
  - **Completion-only loss mask**: hand-rolled — labels before
201
- `<start_of_turn>model\n` are masked to `-100`. The model learns only from
202
- the assistant turn, not the user prompt.
203
- - **5 epochs**, lr `3e-5`, cosine schedule, 0.1 warmup, weight decay 0.01.
204
- - **Effective batch = 16** (`per_device_train_batch_size=8 ×
205
- gradient_accumulation_steps=2`).
206
- - **`max_length=256`** the trained prompt format is ~13 tokens and the
207
- assistant turn fits comfortably under 64 tokens, including respond()
208
- messages.
209
- - bf16, gradient checkpointing, `adamw_torch_fused`, `metric_for_best_model="eval_loss"` + `load_best_model_at_end=True`.
 
 
210
  - Training wallclock: **5 min on a single H100** (~15–20 min on a 4090).
211
 
212
  ### Citation
@@ -219,55 +234,73 @@ Direct port of the SmartPanel v15 trainer:
219
  year = {2024},
220
  url = {https://arxiv.org/abs/2404.01744}
221
  }
 
 
 
 
 
 
 
222
  ```
223
 
224
  ## Results
225
 
226
  ### Training metrics (final epoch)
227
 
228
- | | v9 |
229
  |---|---|
230
- | Final train loss | 0.555 |
231
- | Final eval loss | **0.090** |
232
  | Mean token accuracy (eval) | **97.9 %** |
233
 
234
- ### Held-out smoke test (post-train, 30 prompts spanning all 8 tools)
235
 
236
- | | v9 |
237
  |---|---|
238
- | Smoke-test routing accuracy | **30 / 30 (100 %)** |
239
 
240
- The 30-prompt suite covers single-tool happy paths for every tool plus
241
- the failure modes that broke v8: ambiguous LED prompts ("turn off the
242
- lights"), effect-name without `neopixels` keyword ("do the aurora"),
243
- unsupported features ("play a tone at 2000 hz"), and out-of-scope
244
- appliances ("turn on the TV"). All 8 of those route to `respond()` with a
245
- helpful explanation.
246
 
247
  ### On-device benchmark (Coralboard, 2-core Cortex-A55 @ 2 GHz, Q5_K_M GGUF)
248
 
249
- | | v7 (schema-in-prompt, 10 tools) | v9 (Octopus v2, 8 tools) |
250
- |---|---|---|
251
- | Prompt tokens | ~1088 | ~13 |
252
- | **Cold prefill (turn 1)** | **57.3 s** | **0.55 s** (105× faster) |
253
- | Warm prefill (turn 2+) | ~3 s | ~0.4 s |
254
- | Decode for a typical call | 0.5–1.2 s | 0.5–1.2 s |
255
- | End-to-end first-turn (model load 2.3 s + prefill + decode) | ~62 s | ~3 s |
256
- | Routing on a 29-prompt board bench | n/a directly comparable | **29 / 29 (100 %)** |
 
 
 
 
 
 
257
 
258
  ## Files
259
 
260
  ```
261
- functiongemma-physical-ai-v9-Q5_K_M.gguf # ~248 MB, v9 GGUF Q5_K_M weights (Ollama / llama.cpp)
262
  Modelfile # Ollama Modelfile (functional-token format)
263
- tools.json # 8-tool schema (v9, canonical mobile-actions format)
264
- token_map.json # functional-token <-> tool-name map (v9)
265
  README.md # this file
266
  ```
267
 
268
- Legacy v6/v7 GGUFs are kept in repo history for reproducibility but should
269
- not be used for new deployments — they require the schema-in-prompt
270
- inference wrapper and pay the ~57 s cold-prefill cost.
 
 
 
 
 
271
 
272
  ## License
273
 
@@ -279,6 +312,7 @@ By using this model you agree to those terms. Base model:
279
 
280
  - Base model: <https://huggingface.co/google/functiongemma-270m-it>
281
  - Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
 
282
  - Hardware demo + integration code (Synaptics Coralboard, Grinn HAT,
283
  WLED-over-USB-CDC, full PyQt UI):
284
  <https://github.com/synaptics-astra-demos/sl2610-examples> →
 
17
  inference: false
18
  ---
19
 
20
+ # FunctionGemma 270M — Physical AI (v10, Octopus v2)
21
 
22
  Fine-tuned [`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it)
23
  for voice-controlled physical-AI / household-IoT actions on a Synaptics
24
  SL2619 "Coral" edge board (Google IO 2026 demo).
25
 
26
+ **Current revision:** [`functiongemma-physical-ai-v10-Q5_K_M.gguf`](./functiongemma-physical-ai-v10-Q5_K_M.gguf)
27
+ — 6 tools, ~248 MB Q5_K_M, ~0.48 s cold prefill on the 2-core
28
+ Cortex-A55, 97.9 % mean token accuracy on eval.
 
 
 
29
 
30
+ Schema ships as [`tools.json`](./tools.json). Token-to-tool mapping is
31
+ in [`token_map.json`](./token_map.json).
32
 
33
+ ## Tool surface (6 tools)
34
 
35
+ | Token | Name | Args | Purpose |
36
+ |---|---|---|---|
37
+ | `<tool_0>` | `set_lights` | `color?`, `effect?`, `state?` | Drive whatever lights are connected — HAT 3-LED indicators or a WLED-driven addressable strip / ring. All three args optional; the model emits only what the user implied. |
38
+ | `<tool_1>` | `play_buzzer` | `pattern` | Named pattern on the piezo buzzer: `beep`, `double_beep`, `chirp`, `siren`, `alarm`, `success`, `error`. |
39
+ | `<tool_2>` | `set_alarm` | `duration` or `time`, `label?` | Schedule an alarm. Fires the buzzer plus a visible flash. |
40
+ | `<tool_3>` | `cancel_alarm` | `label?` | Cancel one alarm by label, or all if no label given. |
41
+ | `<tool_4>` | `get_system_status` | `metric` | `cpu`, `memory`, `temperature`, `npu`, or `all`. |
42
+ | `<tool_5>` | `respond` | `message` | Natural-language reply when no physical-action tool fits, or when the request is ambiguous and the model needs to ask for clarification. |
43
+
44
+ The model is **hardware-agnostic** for lighting: it parses user intent
45
+ into semantic args (`color`, `effect`, `state`) and leaves the dispatcher
46
+ to map those onto whatever LED hardware is detected at launch — the
47
+ HAT's three indicator LEDs, a WLED-driven strip, or a Neopixel ring. The
48
+ user vocabulary is hardware-agnostic too: "lights", "LEDs", "strip",
49
+ "indicators" all refer to whatever is wired up.
50
+
51
+ ## Prompt format
52
+
53
+ The v10 model is trained
54
+ [Octopus v2](https://arxiv.org/abs/2404.01744) style: no schema, no
55
+ tools list, just a bare user turn.
56
 
57
+ ```
58
+ <start_of_turn>user
59
+ {user_text}<end_of_turn>
60
+ <start_of_turn>model
61
 
62
+ ```
 
 
 
 
 
 
63
 
64
+ Tool semantics live in the model weights (via the special functional
65
+ tokens `<tool_0>` `<tool_5>` plus `<end>`), not in the prompt. The
66
+ `tools.json` schema in this repo is the dispatcher's arg-validation
67
+ contract and is embedded in the GGUF metadata for schema-drift checks,
68
+ but it is **not** loaded into the inference prompt. Typical prompts are
69
+ ~13 tokens.
70
 
71
+ ## Output format — functional tokens, named args
72
 
73
+ Tool calls emit as **functional tokens with named arguments**, per the
74
+ Mercedes-Benz Octopus v2 convention
75
+ ([arXiv 2501.02342](https://arxiv.org/abs/2501.02342)). Each tool name
76
+ compiles to a single special-vocabulary token (`<tool_0>` `<tool_5>`);
77
+ arguments are written as `name="value"` pairs; a single `<end>` token
78
+ terminates the call. The model emits **only the args the user implied**
79
+ absent args are simply not present.
80
+
81
+ Examples:
82
+
83
+ | User says | Model emits | Resolves to |
84
+ |---|---|---|
85
+ | `turn the lights red` | `<tool_0>(color="red")<end>` | `set_lights(color="red")` |
86
+ | `rainbow on the strip` | `<tool_0>(effect="rainbow")<end>` | `set_lights(effect="rainbow")` |
87
+ | `lights off` | `<tool_0>(state="off")<end>` | `set_lights(state="off")` |
88
+ | `red sparkle` | `<tool_0>(color="red", effect="sparkle")<end>` | `set_lights(color="red", effect="sparkle")` |
89
+ | `set an alarm in 5 minutes` | `<tool_2>(duration="5 minutes")<end>` | `set_alarm(duration="5 minutes")` |
90
+ | `cancel all alarms` | `<tool_3>()<end>` | `cancel_alarm()` |
91
+ | `what's the cpu` | `<tool_4>(metric="cpu")<end>` | `get_system_status(metric="cpu")` |
92
+ | `good morning` | `<tool_5>(message="Good morning. ...")<end>` | `respond(message="...")` |
93
+
94
+ A complete call decodes in roughly 820 output tokens, well inside the
95
+ sub-second voice-UX budget on a 2-core Cortex-A55.
96
+
97
+ > ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or
98
+ > `<eos>`), NOT on `<end>`. The model can emit multi-tool sequences
 
 
 
 
99
  > `<tool_A>(args)<end><tool_B>(args)<end>`, so stopping at the first
100
  > `<end>` truncates legitimate multi-tool output.
101
 
 
103
 
104
  ```bash
105
  hf download BrinqAI/functiongemma-270m-physical-ai \
106
+ functiongemma-physical-ai-v10-Q5_K_M.gguf Modelfile tools.json token_map.json \
107
  --local-dir ./fg-physical-ai
108
 
109
  cd fg-physical-ai
110
  ollama create functiongemma-physical-ai -f Modelfile
111
  ```
112
 
113
+ The shipped `Modelfile` bakes in the stop tokens (`<end_of_turn>`,
114
+ `<eos>`) and decode parameters (`temperature=0`, `num_ctx=1024`,
115
+ `num_predict=80`).
116
 
117
  ## Calling the model
118
 
119
+ Send a **bare user turn** — no schema, no tools list. With Ollama, use
120
+ `raw=true`:
121
 
122
  ```python
123
  import json
 
129
 
130
  reverse_token_map = json.load(open("token_map.json"))["reverse"]
131
 
132
+ NAMED_ARG_RE = re.compile(r'(\w+)\s*=\s*"((?:[^"\\]|\\.)*)"')
133
+
134
 
135
  def build_prompt(user_text: str) -> str:
136
  return (
 
161
  return json.loads(resp.read())["response"]
162
 
163
 
164
+ def parse_call(raw: str) -> tuple[str | None, dict[str, str]]:
165
+ """Return (tool_name, kwargs). tool_name is None on parse fail."""
166
  m = re.match(r"\s*(<tool_\d+>)\((.*?)\)<end>", raw)
167
  if not m:
168
+ return None, {}
169
+ tok, body = m.group(1), m.group(2)
170
+ kwargs = {k: v for k, v in NAMED_ARG_RE.findall(body)}
171
+ return reverse_token_map.get(tok), kwargs
172
 
173
 
174
+ raw = call_model("turn the lights red")
175
+ print(raw) # e.g. '<tool_0>(color="red")<end>'
176
+ print(parse_call(raw)) # ('set_lights', {'color': 'red'})
177
  ```
178
 
179
  For `llama-cpp-python` directly, use `detokenize(..., special=True)` so
 
182
 
183
  ## Training data
184
 
185
+ Training data was generated from Haiku-authored phrasing templates
186
  crossed with deterministic entity pools, then lightly augmented with
187
  Moonshine-flavored ASR noise (dropped function words, lowercased traces,
188
+ filler-word prepends). Each record is a flat `{input, output}` pair —
189
+ no tools / messages array, no chat template.
190
 
191
+ | | |
192
  |---|---|
193
+ | Train rows | 5,222 |
194
+ | Eval rows | 920 |
195
+ | Tools | 6 |
196
+ | Per-template entity expansion | color × effect × state pools for `set_lights`; pattern pool for `play_buzzer`; duration / time pools for `set_alarm`; metric pool for `get_system_status` |
197
+ | ASR-style augmentation | Moonshine-sim noise on a fraction of records (dropped articles, lowercased traces, filler prepends) |
198
+ | Multi-tool fraction | None — single-tool emphasis; multi-tool routines composed at dispatch time |
199
+
200
+ The `set_lights` tool also gets explicit **failure-mode rows** that
201
+ route bare ambiguous prompts to `respond()` — e.g. "rainbow" alone
202
+ ("Did you mean the lights? Try 'rainbow on the lights'."), "siren" alone
203
+ (prompts the user toward `play_buzzer`), and bare "on" / "off"
204
+ (asks what the user wants to act on).
205
 
206
  ## Methodology
207
 
 
 
208
  - **Full bf16 fine-tune** (no LoRA).
209
+ - **Functional tokens**: `<tool_0>` … `<tool_5>` + `<end>` added as
210
+ `additional_special_tokens`; new embeddings **mean-initialized** from
211
+ the existing input-embedding matrix (random init under-converges on
212
+ small datasets at this scale).
213
  - **Completion-only loss mask**: hand-rolled — labels before
214
+ `<start_of_turn>model\n` are masked to `-100`. The model learns only
215
+ from the assistant turn, not the user prompt.
216
+ - **5 epochs**, lr `3e-5`, cosine schedule, 0.1 warmup, weight decay
217
+ 0.01.
218
+ - **Effective batch = 16**
219
+ (`per_device_train_batch_size=8 × gradient_accumulation_steps=2`).
220
+ - **`max_length=256`** the trained prompt format is ~13 tokens and
221
+ the assistant turn fits comfortably under 64 tokens, including
222
+ `respond()` messages.
223
+ - bf16, gradient checkpointing, `adamw_torch_fused`,
224
+ `metric_for_best_model="eval_loss"` + `load_best_model_at_end=True`.
225
  - Training wallclock: **5 min on a single H100** (~15–20 min on a 4090).
226
 
227
  ### Citation
 
234
  year = {2024},
235
  url = {https://arxiv.org/abs/2404.01744}
236
  }
237
+
238
+ @article{merc2025octopusv2,
239
+ title = {Octopus v2 named-arg function calling},
240
+ journal = {arXiv preprint arXiv:2501.02342},
241
+ year = {2025},
242
+ url = {https://arxiv.org/abs/2501.02342}
243
+ }
244
  ```
245
 
246
  ## Results
247
 
248
  ### Training metrics (final epoch)
249
 
250
+ | | |
251
  |---|---|
252
+ | Final train loss | 0.493 |
253
+ | Final eval loss | **0.046** |
254
  | Mean token accuracy (eval) | **97.9 %** |
255
 
256
+ ### Held-out smoke test (post-train, 36 prompts spanning all 6 tools)
257
 
258
+ | | |
259
  |---|---|
260
+ | Smoke-test routing accuracy | **35 / 36 (97.2 %)** |
261
 
262
+ The 36-prompt suite covers single-tool happy paths for every tool plus
263
+ failure modes the model is expected to deflect: ambiguous color words
264
+ without a target ("make it red"), effect names without a target
265
+ ("rainbow"), unsupported features ("play a tone at 2000 hz"), and
266
+ out-of-scope appliances. Failure-mode prompts all route to `respond()`
267
+ with a helpful clarification message.
268
 
269
  ### On-device benchmark (Coralboard, 2-core Cortex-A55 @ 2 GHz, Q5_K_M GGUF)
270
 
271
+ Measured with `llama-cpp-python` 0.3.16, `n_ctx=1024`, `n_threads=2`,
272
+ CPU governor `performance`, 8 representative prompts spanning all 6
273
+ tools.
274
+
275
+ | | |
276
+ |---|---|
277
+ | Model load | 2.23 s |
278
+ | Prompt tokens | 11–16 (mean ~13) |
279
+ | **Cold prefill (turn 1)** | **0.48 s** |
280
+ | Warm prefill (turn 2+, avg) | 0.47 s |
281
+ | Decode rate | **~9.7 tok/s** |
282
+ | Decode time, typical tool call (3–8 output tokens) | 0.3–0.8 s |
283
+ | Decode time, `respond()` (~25 output tokens) | ~2.6 s |
284
+ | End-to-end first turn (model load + prefill + decode) | ~3.4 s |
285
 
286
  ## Files
287
 
288
  ```
289
+ functiongemma-physical-ai-v10-Q5_K_M.gguf # ~248 MB, Q5_K_M weights (Ollama / llama.cpp)
290
  Modelfile # Ollama Modelfile (functional-token format)
291
+ tools.json # 6-tool schema, canonical mobile-actions format
292
+ token_map.json # functional-token <-> tool-name map
293
  README.md # this file
294
  ```
295
 
296
+ Earlier checkpoint GGUFs from the project's development history
297
+ (`functiongemma-physical-ai-v9-Q5_K_M.gguf`,
298
+ `functiongemma-physical-ai-v7-Q5_K_M.gguf`,
299
+ `functiongemma-physical-ai-v6-Q5_K_M.gguf`,
300
+ `functiongemma-physical-ai-Q4_K_M.gguf`) remain in the repo for
301
+ reproducibility. They use different tool surfaces and (for v7 and
302
+ earlier) a different inference-prompt format; new deployments should use
303
+ the v10 file above.
304
 
305
  ## License
306
 
 
312
 
313
  - Base model: <https://huggingface.co/google/functiongemma-270m-it>
314
  - Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
315
+ - Mercedes-Benz Octopus v2 (named-arg variant): <https://arxiv.org/abs/2501.02342>
316
  - Hardware demo + integration code (Synaptics Coralboard, Grinn HAT,
317
  WLED-over-USB-CDC, full PyQt UI):
318
  <https://github.com/synaptics-astra-demos/sl2610-examples> →
functiongemma-physical-ai-v10-Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44eaf599b448df2a72d9384a8476ead654383bdb109f05c963ad8d1c06fce26c
3
+ size 260045600
token_map.json CHANGED
@@ -1,38 +1,24 @@
1
  {
2
- "version": "0.4.0",
3
- "description": "Compressed token map for FunctionGemma CPU inference on SL2619. v9: 8 tools (set_status_led, blink_status_led, set_neopixel_effect, play_buzzer, set_alarm, cancel_alarm, get_system_status, respond) + <end> terminator. Trained Octopus v2 style functional tokens are the entire output vocabulary the model uses for routing; the tool schema (tools.json) is NOT loaded into the inference prompt. v9 drops the unused <tool_none> sentinel that v8's training pipeline reserved but never emitted.",
4
  "tokens": {
5
- "set_status_led": "<tool_0>",
6
- "blink_status_led": "<tool_1>",
7
- "set_neopixel_effect": "<tool_2>",
8
- "play_buzzer": "<tool_3>",
9
- "set_alarm": "<tool_4>",
10
- "cancel_alarm": "<tool_5>",
11
- "get_system_status": "<tool_6>",
12
- "respond": "<tool_7>"
13
  },
14
  "reverse": {
15
- "<tool_0>": "set_status_led",
16
- "<tool_1>": "blink_status_led",
17
- "<tool_2>": "set_neopixel_effect",
18
- "<tool_3>": "play_buzzer",
19
- "<tool_4>": "set_alarm",
20
- "<tool_5>": "cancel_alarm",
21
- "<tool_6>": "get_system_status",
22
- "<tool_7>": "respond"
23
  },
24
- "special_tokens": [
25
- "<tool_0>",
26
- "<tool_1>",
27
- "<tool_2>",
28
- "<tool_3>",
29
- "<tool_4>",
30
- "<tool_5>",
31
- "<tool_6>",
32
- "<tool_7>",
33
- "<end>"
34
- ],
35
- "output_format": "<tool_N>(\"arg1\",\"arg2\",...)<end>",
36
  "prompt_format": "<start_of_turn>user\\n{user_text}<end_of_turn>\\n<start_of_turn>model\\n",
37
- "notes": "Argument order positional per canonical schema's required-first then optional declaration order. v9 trains Octopus v2 pure (no schema in prompt) see prompt_format. set_neopixel_effect routing keyword: 'neopixels' (literal substring, case-insensitive) required in user prompt; otherwise the model routes to respond() asking the user to disambiguate (HAT status LEDs vs. neopixel ring)."
38
  }
 
1
  {
2
+ "version": "0.5.0-draft",
3
+ "description": "DRAFT v10 compact token map. 6 tools + <end> terminator. Single unified set_lights tool with semantic named args (Mercedes-Benz Octopus v2 convention, arXiv 2501.02342). The model is hardware-agnostic; the dispatcher maps semantic args to whatever LED hardware is detected at launch.",
4
  "tokens": {
5
+ "set_lights": "<tool_0>",
6
+ "play_buzzer": "<tool_1>",
7
+ "set_alarm": "<tool_2>",
8
+ "cancel_alarm": "<tool_3>",
9
+ "get_system_status": "<tool_4>",
10
+ "respond": "<tool_5>"
 
 
11
  },
12
  "reverse": {
13
+ "<tool_0>": "set_lights",
14
+ "<tool_1>": "play_buzzer",
15
+ "<tool_2>": "set_alarm",
16
+ "<tool_3>": "cancel_alarm",
17
+ "<tool_4>": "get_system_status",
18
+ "<tool_5>": "respond"
 
 
19
  },
20
+ "special_tokens": ["<tool_0>", "<tool_1>", "<tool_2>", "<tool_3>", "<tool_4>", "<tool_5>", "<end>"],
21
+ "output_format": "<tool_N>(name1=\"value1\", name2=value2, ...)<end>",
 
 
 
 
 
 
 
 
 
 
22
  "prompt_format": "<start_of_turn>user\\n{user_text}<end_of_turn>\\n<start_of_turn>model\\n",
23
+ "notes": "v10 uses NAMED arguments per Mercedes-Benz Octopus v2 (arXiv 2501.02342), the published Octopus-v2 reference that demonstrates production multi-arg tool calls with optional args absent. The model emits only the args the user implied; absent args are simply not emitted. set_lights kept to 3 optional args (color/effect/state) for robustness on the 270M brightness/count/speed were dropped after analysis; the dispatcher uses sensible defaults (full brightness, normal speed, ~3 repetitions for blink). Backwards-compat positional parsing is retained in compact_codec.py so the parser still handles v9 GGUF output during the transition."
24
  }
tools.json CHANGED
@@ -1,97 +1,41 @@
1
  {
2
- "version": "0.4.0",
3
- "description": "Physical AI tool schema for Coral Dev Board (SL2619) FunctionGemma demo. Canonical mobile-actions format. v9: same 8-tool surface as v8, but the model is trained Octopus v2 style functional tokens (<tool_0>..<tool_7>) emitted directly from a minimal user-only prompt with NO tool schema in the context. This shrinks the cold prefill from ~1088 prompt tokens (v7/v8 with the FunctionGemma developer turn) to ~13, taking on-board cold first-turn from ~57s to ~0.5s on a 2-core A55. The schema in this file is purely a developer/runtime contract (dispatcher arg validation, GGUF metadata, documentation) it is NOT injected into the inference prompt. Surface routing keyword: 'neopixels' (literal) for the ring; 'LED' / 'light' / 'red light' / 'green light' / 'blue light' for the HAT status LEDs.",
 
4
  "tools": [
5
  {
6
  "function": {
7
- "name": "set_status_led",
8
- "description": "Turn one or all of the HAT status LEDs on or off. The HAT has three individual LEDs (red, green, blue), each independently addressable. Invoke when the user mentions 'LED', 'LEDs', or 'the <color> light'.",
9
  "parameters": {
10
  "type": "OBJECT",
11
  "properties": {
12
- "led": {
13
- "type": "STRING",
14
- "description": "Which LED: 'red', 'green', 'blue', or 'all'."
15
- },
16
- "state": {
17
- "type": "STRING",
18
- "description": "'on' or 'off'."
19
- },
20
- "brightness": {
21
- "type": "INTEGER",
22
- "description": "Brightness 0-100. Optional, default 100. Ignored when state is 'off'."
23
- }
24
- },
25
- "required": ["led", "state"]
26
- }
27
- }
28
- },
29
- {
30
- "function": {
31
- "name": "blink_status_led",
32
- "description": "Blink one or all HAT status LEDs a given number of times. Invoke for 'blink/flash the <color> light' or 'blink the LEDs' style requests.",
33
- "parameters": {
34
- "type": "OBJECT",
35
- "properties": {
36
- "led": {
37
- "type": "STRING",
38
- "description": "Which LED: 'red', 'green', 'blue', or 'all'."
39
- },
40
- "count": {
41
- "type": "INTEGER",
42
- "description": "Number of blinks. Default 3."
43
- },
44
- "speed": {
45
- "type": "STRING",
46
- "description": "One of 'slow', 'normal', 'fast'. Default 'normal'."
47
- }
48
- },
49
- "required": ["led"]
50
- }
51
- }
52
- },
53
- {
54
- "function": {
55
- "name": "set_neopixel_effect",
56
- "description": "Play a visual effect on the neopixel ring (48-pixel WS2812B ring around the 7\" display, driven by WLED). Only invoke when the user explicitly mentions 'neopixels'. Use effect='off' to turn the ring off.",
57
- "parameters": {
58
- "type": "OBJECT",
59
- "properties": {
60
- "effect": {
61
- "type": "STRING",
62
- "description": "One of: 'solid', 'pulse', 'fade', 'chase', 'rainbow', 'sparkle', 'off', 'aurora', 'plasma', 'comet', 'twinkle', 'fireworks', 'police', 'heartbeat', 'loading', 'lightning', 'glitter', 'fire', 'sunrise'."
63
- },
64
  "color": {
65
  "type": "STRING",
66
- "description": "Color name (e.g. 'red', 'teal', 'amber', 'gold', 'violet') or 6-digit hex like '#FF8800'. Used by effects that take a primary color (solid, pulse, fade, chase, sparkle, comet). Ignored for rainbow and palette-driven effects."
67
- },
68
- "palette": {
69
- "type": "STRING",
70
- "description": "Color palette: 'auto', 'ocean', 'lava', 'forest', 'sunset', 'party', 'sherbet', 'c9', 'aurora', 'beach', 'fire', 'sakura', 'splash', 'pastel'. Most useful with aurora, plasma, fire, twinkle, comet."
71
  },
72
- "speed": {
73
  "type": "STRING",
74
- "description": "One of 'slow', 'normal', 'fast'. Default 'normal'."
75
  },
76
- "intensity": {
77
  "type": "STRING",
78
- "description": "One of 'low', 'medium', 'high'. Default 'medium'. Controls effect density / depth (sparkle density, fire height, comet tail length, aurora width)."
79
  }
80
- },
81
- "required": ["effect"]
82
  }
83
  }
84
  },
85
  {
86
  "function": {
87
  "name": "play_buzzer",
88
- "description": "Play a named pattern on the piezo buzzer on the HAT. The buzzer is binary GPIO on the Coral Dev Board (BUZZERn), so only timed patterns are supported — no tunable frequency.",
89
  "parameters": {
90
  "type": "OBJECT",
91
  "properties": {
92
  "pattern": {
93
  "type": "STRING",
94
- "description": "Named pattern: 'beep', 'double_beep', 'siren', 'chirp', 'alarm', 'success', 'error'."
95
  }
96
  },
97
  "required": ["pattern"]
@@ -101,7 +45,7 @@
101
  {
102
  "function": {
103
  "name": "set_alarm",
104
- "description": "Schedule an alarm to go off at a given time or after a duration. Alarm triggers buzzer + flashing LEDs.",
105
  "parameters": {
106
  "type": "OBJECT",
107
  "properties": {
@@ -154,7 +98,7 @@
154
  {
155
  "function": {
156
  "name": "respond",
157
- "description": "Reply to the user in natural language without taking any physical action. Use this when the request is out of scope (no matching tool), ambiguous (needs clarification — e.g. surface keyword missing for LED requests), purely conversational (greetings, thanks), or impossible on this device. Do NOT use this when any physical-action tool fits the request.",
158
  "parameters": {
159
  "type": "OBJECT",
160
  "properties": {
 
1
  {
2
+ "version": "0.5.0-draft",
3
+ "description": "DRAFT v10 schema. Single unified set_lights tool replaces v9's three LED tools (set_status_led, blink_status_led, set_neopixel_effect). The model is hardware-agnosticit parses user intent into semantic args; the dispatcher maps those args to whichever LED hardware is detected at launch (HAT 3-LED indicators or WLED strip). User vocabulary is hardware-agnostic too: 'lights', 'LEDs', 'strip' all refer to whatever is connected. Named-args format per Mercedes-Benz Octopus v2 (arXiv 2501.02342) emitted calls look like <tool_0>(color=\"red\", state=\"on\")<end>. 6 tools (down from 8). set_lights kept to 3 args (color/effect/state) for robustness on the 270M brightness/count/speed dropped after analysis showed they appeared in zero of 5 observed voice failures and the dispatcher's defaults cover them.",
4
+ "format": "named",
5
  "tools": [
6
  {
7
  "function": {
8
+ "name": "set_lights",
9
+ "description": "Set the lights according to user intent. Emit only the args the user implied color if they named one, effect if they named an animation/pattern, state if they said on/off. All three args optional; the dispatcher fills sensible defaults (full brightness, normal speed, ~3 repetitions for blink) and approximates effects the connected hardware can't perform natively.",
10
  "parameters": {
11
  "type": "OBJECT",
12
  "properties": {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  "color": {
14
  "type": "STRING",
15
+ "description": "Color name. Common values: 'red', 'green', 'blue', 'white', 'yellow', 'purple', 'orange', 'pink', 'cyan'. HAT mode maps non-RGB colors to closest combo (e.g. yellow=red+green)."
 
 
 
 
16
  },
17
+ "effect": {
18
  "type": "STRING",
19
+ "description": "Named animation. Strip mode handles all natively. HAT mode approximates on 3 LEDs. Values: 'solid', 'blink', 'pulse', 'fade', 'rainbow', 'fire', 'plasma', 'aurora', 'police', 'fireworks', 'sparkle', 'twinkle', 'chase', 'comet', 'heartbeat', 'lightning', 'glitter', 'loading', 'sunrise', 'off'."
20
  },
21
+ "state": {
22
  "type": "STRING",
23
+ "description": "'on' or 'off'. Use when the user just toggles lights without specifying a color or effect."
24
  }
25
+ }
 
26
  }
27
  }
28
  },
29
  {
30
  "function": {
31
  "name": "play_buzzer",
32
+ "description": "Play a named pattern on the piezo buzzer on the HAT. Binary GPIO only timed patterns, no tunable frequency.",
33
  "parameters": {
34
  "type": "OBJECT",
35
  "properties": {
36
  "pattern": {
37
  "type": "STRING",
38
+ "description": "One of: 'beep', 'double_beep', 'siren', 'chirp', 'alarm', 'success', 'error'."
39
  }
40
  },
41
  "required": ["pattern"]
 
45
  {
46
  "function": {
47
  "name": "set_alarm",
48
+ "description": "Schedule an alarm to go off at a given time or after a duration. Alarm triggers the buzzer plus a visible flash on whatever lights are connected.",
49
  "parameters": {
50
  "type": "OBJECT",
51
  "properties": {
 
98
  {
99
  "function": {
100
  "name": "respond",
101
+ "description": "Reply to the user in natural language without taking any physical action. Use when the request is out of scope, ambiguous, purely conversational, or impossible on this device. Do NOT use when set_lights or another physical-action tool fits the request.",
102
  "parameters": {
103
  "type": "OBJECT",
104
  "properties": {