Mac runtime benchmark Report artifact 2026-06-06

Huge Preset Decode Throughput

This is a runtime artifact, not a model-quality claim. Its value is operational: local eval loops need throughput, stable serving, and cheap repeated generation.

MLX
decode
Mac
throughput

Headline Numbers

Huge decode

696 tok/s 96M/Huge preset, ctx 1024

Mega pilot

293 tok/s 960M pilot

Warm TTFT p99

5.8ms reported runtime metric

Competitive Context

System	Metric	Score	Size / Class	Comparable?	Readout
TinyGPT Huge preset	decode throughput	696 tok/s	96M	Direct	Local runtime baseline for cheap repeated eval/smoke loops.
TinyGPT Mega pilot	decode throughput	293 tok/s	960M	Direct	Shows the throughput drop as local model size approaches specialist scale.
External serving stacks	same benchmark	not measured	MLX/llama.cpp/Ollama class	Not comparable	Needs a shared prompt/config/device table before public competitive serving claims.

Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.

Runtime numbers

Metric	Value	Use
Decode throughput	696 tok/s	Fast local eval/smoke loops
Mega pilot throughput	293 tok/s	Boundary mapping for larger local models
Warm TTFT p99	5.8ms	Interactive serving viability

Release Blockers

Preset-specific

The headline number is not a blanket claim for all HF models or specialists.

Unblock: Attach latency/RAM/tok-s numbers to each future specialist artifact.

Evidence

Next Release Action

Use this as the baseline expectation for future artifact performance tables.