← TinyGPT · docs · devlog · roadmap · speedup
source: docs/learn/tool-calling-frontier-parity.md · view on GitHub ↗

Tool-calling: how close can a Mac-local small model get to frontier?

What: the arc from “our 1.7B scores ~55% on tool-calling” to a frontier-validated metric, an honest size curve, and a distillation result that closes most of the gap. Why it matters here: this is the cost-compression thesis made concrete — reach frontier capability at a fraction of the cost, measured on a ruler we trust.

Recorded 2026-06-14. Companion principle: the frontier-ceiling gate in AGENTS.md (“Eval philosophy”). Distillation mechanics: distillation.md.

1. The eval was broken before the models were

The first “55%” came from scoring against hermes-fc gold with exact-string match. That metric is unwinnable:

Rule that came out of this: before any benchmark grades a Mac model, a frontier model must ace it (~100%). If frontier can’t, the eval is broken — fix or drop it. hermes-fc is now training-only, never a reported metric.

2. The ruler we trust: BFCL with AST matching

BFCL golds are verified groundable and multi-valued (each param lists acceptable values). We built a controlled harness (single-turn categories) and validated it: frontier = 124/125 (99.2%). The lone miss (parallel_9) is a doubly-underdetermined gold (batched array call ≡ parallel calls; "5:00 PM""5 pm") — accepted as passing.

Legitimate harness fixes made to reach frontier 100% (not to inflate small models): Python-syntax instruction (BFCL’s own convention), implicit-multiplication canonicalization (3*x3x — BFCL’s own gold 3x**2 is non-executable), recursive nested-dict matching, and a brace-matching parser.

Two parser bugs were hiding the local models’ real ability. The first regex discarded multi-call outputs that used one closing </tool_call> for several blocks (a model convention); the second couldn’t parse bare-JSON calls with nested arguments. Fixing them moved the distilled 1.7B’s parallel_multiple from a fake 8% to a real 60%, and the 4B’s simple_python from a fake 0% to 84%. Lesson: a lenient, well-tested parser is part of a fair eval.

3. The honest size curve (validated BFCL slice, n=25/category)

Modelsimplemultipleparallelpar_multlive_slive_mavg
Frontier (Claude)10010096100100~99
30B-A3B (≈3B active)9696~frontier
base-4B (stock)84969280886083
base-1.7B (stock)92962016724256
distilled-1.7B (hermes)80846860762465
FT-1.7B (ToolACE)80966864925676

Headlines: the 30B-A3B matches frontier on multi-call at ~3B active params (the cost-compression proof). The stock 1.7B is already frontier-level on single-call (92/96) — its only real weakness is multi-call decomposition.

4. Conclusion on distillation/SFT

Best result via distillation/SFT so far: avg 76 for the 1.7B (fine-tuned on 8,270 ToolACE examples, 42% multi-call, prompts identical to the eval). It closed most of the multi-call gap (parallel 20→68, parallel_multiple 16→64), and now beats the 4B on live_simple (92 vs 88). That’s ~2/3 of the way from base-1.7B (56) to base-4B (83).

But SFT-distillation plateaus short of frontier-parity, and it trades:

This matches the validated thesis (see distillation.md): distillation can match but not exceed its data/teacher ceiling. Remaining distillation levers we have NOT exhausted: (a) fix the data mix to recover single-call; (b) distill from our local 30B (already frontier-level on multi-call) instead of a generic dataset. Pure SFT’s cap for the 1.7B looks to be ~the 4B’s level, not frontier.

5. Where reinforcement learning comes in

To exceed the SFT ceiling you need RL — it optimizes a verifiable reward directly rather than imitating data. Status of the from-scratch MLX GRPO work (no MLX GRPO exists; built here):

Result (2026-06-14). GRPO on FT-1.7B (reward = graded AST match + over-emission penalty; KL to frozen FT-1.7B; held-out BFCL [25:] prompts; grad-accumulation for Qwen3’s 151k-vocab logits). Stable throughout (loss ~0, KL ≤ 0.025 — no blowup). It delivered a modest, targeted lift, exactly on the categories we aimed at:

simplemultipleparallelpar_multlive_slive_mavg
SFT (FT-1.7B)80966864925676
+GRPO80966868926478

The arc: base 56 → SFT 76 → GRPO 78. Conclusion: SFT does the heavy lifting; RL is a small targeted top-up (+8 live_multiple, +4 parallel_multiple). The 1.7B plateaus ~78 — short of the 4B and frontier on hard multi-call. Strong result (4B-competitive on 4/6) but not parity. → escalate to the 4B.

6. The 4B sweep — and the punchline: on a strong base, selection beats training

We surveyed the best small bases (June 2026) and ran the recipe. Validated BFCL slice:

Modelsimplemultparpar_mlive_slive_mavg
Frontier (Claude)10010096100100~99
Qwen3-4B-2507 bf16 — STOCK92969696885687.3
Hammer2.1-3b stock (FC-specialist*)961009288846086.7
Qwen3-4B-2507 + ToolRL-GRPO92969292925686.7
Qwen3.5-4B-8bit stock881008084927286.0
Qwen3-4B-2507 4-bit stock84969280886083.3
Qwen3-4B + function-masking SFT84928884726080.0

*Hammer trains on BFCL-like data → structured scores partly by-design.

Findings:

Verdict: best Mac-local 4B tool-caller = Qwen3-4B-Instruct-2507 @ bf16, STOCK — beats SOTA-4B (Hammer-4B 76), zero training, every training intervention made it worse. The meta-lesson: training is the lever for weak bases (1.7B 56→78); on a strong base the wins are base + precision selection. We proved this empirically rather than assuming it.

7. Closing the eval gaps (the honest headline)

Two fixes made the number trustworthy:

On the sound categories the 4B is ~93–94 — near-frontier. The residual gap is genuine capability on the hardest real-user args (live_multiple 64 vs frontier 92), not something more training of this base fixes.

8. Multi-turn / agentic — the cliff, measured (2026-06-14)

Single-turn (88.7) said nothing about holding a conversation. Built a stateful multi-turn harness reusing BFCL’s machinery (execute_multi_turn_func_call + multi_turn_checker + the involved_classes backends); scripts/bfcl_multiturn_eval.py.

The build lesson: the inference side is the eval. A hand-rolled text-transcript prompt under-elicited badly — the 30B-A3B scored 0/8 despite acing single-turn (96/96). The fix: drive the model with its native tool-calling chat template (tools= catalog + proper assistant/tool message roles — what Qwen3 was trained on) + BFCL’s own multi-turn behaviour prompt. That lifted the 30B from 0% → 50% — in BFCL’s expected ~50-70% range for multi_turn_base (the hardest category; even frontier doesn’t ace it).

The cliff (multi_turn_base, native-template harness, n=20 same examples):

Modelsingle-turnmulti-turndrop
Qwen3-30B-A3B (3B active)~9645%−51
Qwen3-4B-2507 bf1688.725%−64

So even the strong 30B more than halves on the hardest multi-turn; the 4B drops to 25%.

But “95-96% multi-turn” is unachievable on multi_turn_base for anyone — the BFCL-V4 leader sits at 75% overall and multi-turn runs lower; frontier caps ~50-70%. By our own frontier-ceiling rule, a 95% bar there is mis-calibrated. So we built a difficulty-graded, right-sized gate — deterministic single-backend (GorillaFileSystem) agentic tasks tuned so a strong model aces the easier tiers while the gap to a small model grows (scripts/make_multiturn_gates.py).

The capability gradient — frontier-validated (DeepSeek-V4-pro, true frontier):

Tiertask shapeDeepSeek-V4-pro30B-A3B (proxy)Qwen3-4B-2507 bf16
single-turnone call~99~9688.7
easy multi-turn1-2 calls100%100%94%
moderate3-4 calls + cd-nav100%100%86%
hard (canonical gate)5-7 calls, deep nesting, 4 turns100%83%58%
multi_turn_base (BFCL hardest)multi-backend, long45%25%

The hard tier is the sound, discriminating gate — a true frontier model (DeepSeek-V4-pro, via OpenAI function-calling) aces it 100%, while the 4B clearly cliffs to 58% (a 42-pt frontier-to-small gap). So make_multiturn_gates.py + the DeepSeek backend (bfcl_multiturn_deepseek.py) give a calibrated multi-turn ruler: frontier ~100%, and the 4B’s curve 94 → 86 → 58 maps exactly where it degrades as agentic complexity rises. (The 30B-A3B sits between at 83% on hard — a strong but not-frontier 3B-active proxy.)

The 4B is a capable simple-agent, not a poor agent — near-frontier on easy/moderate flows, cliffing only as depth/length/turns grow. We expected the climb lever to be multi-turn RL; in fact rejection-sampling distillation alone cleared it (§8.1). RL (GRPO) stays available as a further top-up, but wasn’t needed to reach frontier-parity on this gate.

(API note: BFCL func docs use "type":"dict"; OpenAI/DeepSeek require "object" — the DeepSeek backend normalizes BFCL’s type vocabulary to JSON-schema, and needs a curl User-Agent to clear Cloudflare. Key read from /tmp/deepseek_key / $DS_KEY_FILE, never committed.)

(Build note: the inference side IS the eval — a hand-rolled text transcript scored the 30B 0%; the native tool-calling chat template + proper roles fixed it. And echo-content tasks were flaky because small models over-call and a stray touch blanks the file — idempotent mkdir/mv/rm tasks make the gate clean. Both are real “agentic eval is subtle” lessons.)

8.1 Climbing the cliff — frontier-trajectory distillation (2026-06-16)

Goal: get the 4B from its 58% hard-tier cliff to the ~95% frontier level without stepping up to 8B. Two levers, stacked:

  1. Free first — a plan-then-execute system prompt (plan the full call sequence, act one step at a time, never repeat a succeeded call, stop when done). Stock 4B 58 → 75 on the 12-task gate (the harness SYS is now MT_SYS-overridable). A real +17, but brittle.
  2. The durable lever — RFT (rejection-sampling distillation). Recipe, all Mac-local:
    • Scale the task family: gen_multiturn_trajdata.py templates hundreds of deterministic, idempotent GorillaFileSystem agentic tasks (gold-validated via multi_turn_checker).
    • Teacher trajectories: run DeepSeek-V4-pro over 100 held-out tasks (bfcl_multiturn_deepseek.py --dump), keep only the 99 the checker passed (rejection sampling ⇒ clean labels).
    • Render in the student’s own format: render_sft_from_traj.py re-emits each trajectory through the 4B’s chat template (tools= + <tool_call> + tool roles) as mlx_lm text.
    • LoRA SFT: 16 layers, lr 1e-5, 4 epochs. --grad-checkpoint is mandatory — every example is ~3.1-3.7k tokens (the 18-tool catalog floors them) and the 151k-vocab logits OOM the backward pass without it. One command: distill_multiturn.sh.

The climb (held-out, zero train/eval content overlap — verified):

Qwen3-4B-2507hard gate (12)40-task held-out set
stock58%
+ plan prompt75%60%
+ distilled (99 frontier trajectories)100%95%

The 4B now matches DeepSeek-V4-pro (100%) on the frontier-validated hard gate — frontier-parity on multi-turn file-system agency, at 4B, locally, no 8B needed.

The tradeoff (honest): single-turn BFCL slipped ~87 → 83 avg (simple_python −8, parallel_multiple −16, multiple/parallel unchanged, live_multiple +4) — the classic specialization cost of a narrow SFT. Recoverable by mixing single-turn data into the SFT or fewer epochs if we want both skills; for a multi-turn agentic product (Pace) the trade is strongly positive.

Scope: proven on the GorillaFileSystem multi-turn domain (the gate’s backend). Generalization to other agentic backends (trading, ticketing, …) is untested — distilling a multi-backend mix is the obvious next step. The recipe — author verifiable tasks → frontier RFT → SFT in the student’s template — is domain-general.

Reproducible for free (gold behaviour-cloning ≡ frontier distillation). When the trajectories were lost to a /tmp wipe, we rebuilt the identical model with no teacher API: for verifiable tasks the gold ground-truth is the correct trajectory, so gold_to_sft_traj.py synthesizes SFT data by executing the gold per turn (free, fast, deterministic). It reproduced 100% hard / 95% hardgen exactly — but only after one non-obvious fix: the SFT data must demonstrate the turn-completion STOP signal (an assistant message emitting no tool calls after the work is done). Without it the model never learns to stop, over-calls at eval, and lands at 75% — the gap that a teacher’s trajectories close for free because they naturally end each turn with a no-call message. (Lesson with teeth for self-improvement: knowing when to stop is a learned behaviour, not a freebie — a ReST loop must reward it.)

8.2 Conclusive head-to-head — Pace incumbent (Gemma) vs the 4B (2026-06-16)

Pace ships Gemma; this is the deciding comparison. Same hard gate, same plan prompt, n=12. Gemma scored zero-shot via LM Studio (OpenAI function-calling); the distilled 4B is the §8.1 specialist; frontier + stock-4B are anchors. Reproducible via headtohead_multiturn.sh.

Modelparamshard-gate task-completion
DeepSeek-V4-pro (frontier anchor)100%
Qwen3-4B-2507 — distilled4B100%
Gemma-4-12b-qat12B83%
Qwen3-4B-2507 — stock (+plan prompt)4B75%
Gemma-3-12b12B33%

The distilled 4B matches frontier and beats both Gemma-12B variants at ⅓ the parameters — higher agentic accuracy and smaller/faster/less-RAM. For a multi-turn agentic app, the distilled 4B is the clear winner over the incumbent.

Honest framing: the 4B is specialized on this domain (GorillaFileSystem) via cheap frontier-distillation; Gemma is zero-shot. The claim is the project thesis — a cheaply specialized small model beats a larger general model on the target tasknot “4B > 12B in general.” Distilling Gemma the same way would likely lift it too. Caveats: Gemma-3-12b’s 33% partly reflects weaker tool-call formatting (Gemma-4-qat’s 83% shows the protocol is fine, so most of the gap is real capability); Pace’s exact production Gemma is still TBD (if Gemma-3, the upgrade is dramatic; if Gemma-4-qat, still +17pp at ⅓ size); decode tok/s + RAM (the 4B wins both structurally) and single-turn (distilled 4B ~83) are the remaining leaderboard columns.

8.3 Domain saturated at 4B + a free frontier backend (2026-06-16)

Pushed a longer-horizon veryhard tier (6-8 turns, 9-16 calls, heavy cd-navigation; new templates the 4B never trained on — gen_multiturn_trajdata.py … veryhard) to see if a harder gate would finally separate the distilled 4B from frontier and justify a bigger model:

Modelveryhard (12)
gpt-5.5 (true frontier)100%
Qwen3-4B-2507 — distilled100%
DeepSeek-V4-pro83%
Qwen3-4B-2507 — stock (+plan)25%

The distilled 4B aces it too — matching the strongest frontier (gpt-5.5) on longer unseen tasks, while stock collapses to 25%. Conclusion: the file-ops agentic domain is saturated at 4B — a harder file-ops gate won’t discriminate it, so a distilled 12B has no payoff here. The real open question is breadth (other BFCL backends — trading/ticketing/travel — that the 4B never trained on), not depth. (DeepSeek’s 83% < gpt-5.5’s 100% just means DeepSeek is a slightly less reliable frontier on fiddly 16-call navigation; the gate is sound — true frontier aces it.)

Free frontier backend (cost fix): validation + teacher trajectories now run on the Codex CLI (gpt-5.5), free under subscriptionscripts/bfcl_multiturn_codex.py drives it single-shot per step via codex exec --output-schema (forced JSON tool-calls), reusing the same BFCL executor + checker. Gotcha: OpenAI strict structured-output requires additionalProperties: false on every object and forbids free-form objects, so arguments is passed as a JSON string and parsed. This retires the paid DeepSeek API for routine frontier work.

8.4 Breadth — narrow distillation causes negative transfer (2026-06-16)

Saturation at file-ops raised the real question: does the specialist generalize? Tested on 52 held-out single-backend, non-filesystem BFCL multi_turn tasks (TradingBot 20, VehicleControlAPI 19, TravelAPI 13 — domains the 4B never trained on), same generic prompt, distilled vs stock:

4Bfile-ops (hard gate)out-of-domain breadth (52)
stock58%59.6%
distilled (file-ops only)100%42.3%

The file-ops distillation made the model worse everywhere else — 60% → 42%, a −17pt regression. Apples-to-apples (same tasks, same prompt), so this is real catastrophic forgetting / negative transfer, not noise. We bought depth (file-ops 58→100) at the cost of breadth. The distilled 4B is a file-ops specialist, not a better agent.

Implications:

8.5 The gold-cloning ceiling — why breadth needs interleaved trajectories (2026-06-16)

Tried the obvious fix for §8.4’s negative transfer: multi-backend gold-cloning — 248 tasks spanning file-ops + Trading/Vehicle/Travel/Ticket/Message/Twitter (clean split from the 52-task eval), gold-cloned and SFT’d. It made breadth worse:

4Bfile-ops (depth)breadth (52 out-of-domain)
stock58%59.6% ← still the best breadth
file-ops gold-distill100%42.3%
multi-backend gold-distill100%30.8%

Root cause (measured, not guessed): 52% of multi-backend turns have a call argument that comes from a tool result, not the user prompt. Example multi_turn_base_57: the gold lumps one turn as get_zipcode(...), get_zipcode(...), estimate_distance(cityA='69238', cityB='51479') — where 69238/51479 are the zipcodes the get_zipcode calls return. Behaviour-cloning that gold teaches the model to (a) emit all calls blind before seeing any result and (b) hallucinate the specific result values. More such data → more harm.

The law: gold-cloning ≡ frontier distillation only when call args are derivable from the user prompt (file-ops: names, paths — §8.1 worked for exactly this reason). For data-dependent agency, the thing to learn is the trajectory structurecall → read result → use result in the next call — and the gold does not contain that structure. Cloning concrete result-values is anti-learning.

What actually fixes breadth: interleaved trajectories that demonstrate reading a result before using it — either a frontier teacher (gpt-5.5 via the free Codex backend, which calls get_zipcode, reads 69238, then calls estimate_distance) or the model’s own rollouts in a ReST/RL loop (interleaving is intrinsic; the checker filters the correct ones). This is the decisive, evidence-backed motivation for self-improving-agents.md: the loop teaches the one thing gold-cloning structurally can’t.

Practical takeaway for Pace: today the stock 4B (60%) is the best multi-domain agent; the gold-distilled models win only on their narrow domain (file-ops 100%). Either route to the specialist on its domain, or train breadth with real interleaved trajectories — not gold-clones.

See also