File-ops hard gate
100% up from 58% stock 4BQwen3-4B File-Ops Distilled
This is the strongest model win in the repo: a Mac-built specialist reaches 100% on the file-ops hard gate. It is also the clearest example of why routing is mandatory, because breadth drops outside the trained domain.
Headline Numbers
Heldout file-ops
95% hardgen heldout suiteBreadth after tuning
42.3% down from 59.6% stockArtifact size
7.5GB local HF/MLX safetensors directoryCompetitive Context
| System | Metric | Score | Size / Class | Comparable? | Readout |
|---|---|---|---|---|---|
| TinyGPT Qwen3-4B file-ops specialist | local file-ops hard gate | 100% | 4B, 7.5GB package | Direct | Domain specialist result; not a general BFCL leaderboard submission. |
| Stock Qwen3-4B | same local file-ops hard gate | 58% | 4B | Direct | Before/after delta is +42 points on the frozen domain gate. |
| Frontier calibration | same local file-ops hard gate | ~99-100% | frontier API/teacher | Direct | Used as the ceiling check for whether the eval is a usable ruler. |
| BFCL V4 public leader | overall BFCL V4 accuracy | 75.0% | large public model | Directional | LLM Stats snapshot for Qwen3.7 Max; it marks BFCL-V4 rows as self-reported/unverified, so this is market context only. |
| BFCL V4 public average | overall BFCL V4 accuracy | 61.1% | 13 tracked models | Directional | LLM Stats reports 13 self-reported rows and 0 verified rows; TinyGPT still needs a full BFCL submission for direct comparison. |
| Qwen3.5-4B public BFCL-V4 row | overall BFCL V4 accuracy | 50.3% | 4B | Directional | Closest public 4B-class tool-calling row in the same LLM Stats snapshot, but still not the local file-ops gate. |
Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.
Measured result
| Gate | Stock | Specialist | Readout |
|---|---|---|---|
| File-ops hard gate | 0.58 | 1.00 | Domain win |
| File-ops hardgen heldout | - | 0.95 | Generalizes within file ops |
| Out-of-domain breadth | 0.596 | 0.423 | Regression; route only |
Release Blockers
Weight distribution undecided
The package lock points to a local cache path, not a public artifact host.
Unblock: Decide metadata-only release vs durable hosted weight release.
Breadth regression
The tuned model is wrong to use as a general planner.
Unblock: Keep all public copy routed-only and include the negative-transfer table.