Mac runtime benchmark Report artifact 2026-06-06

Huge Preset Decode Throughput

This is a runtime artifact, not a model-quality claim. Its value is operational: local eval loops need throughput, stable serving, and cheap repeated generation.

Headline Numbers

Huge decode

696 tok/s 96M/Huge preset, ctx 1024

Mega pilot

293 tok/s 960M pilot

Warm TTFT p99

5.8ms reported runtime metric

Competitive Context

System Metric Score Size / Class Comparable? Readout
TinyGPT Huge preset decode throughput 696 tok/s 96M Direct Local runtime baseline for cheap repeated eval/smoke loops.
TinyGPT Mega pilot decode throughput 293 tok/s 960M Direct Shows the throughput drop as local model size approaches specialist scale.
External serving stacks same benchmark not measured MLX/llama.cpp/Ollama class Not comparable Needs a shared prompt/config/device table before public competitive serving claims.

Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.

Runtime numbers

MetricValueUse
Decode throughput696 tok/sFast local eval/smoke loops
Mega pilot throughput293 tok/sBoundary mapping for larger local models
Warm TTFT p995.8msInteractive serving viability

Release Blockers

Preset-specific

The headline number is not a blanket claim for all HF models or specialists.

Unblock: Attach latency/RAM/tok-s numbers to each future specialist artifact.

Evidence

Next Release Action

Use this as the baseline expectation for future artifact performance tables.