
Squeezing a 26B diffusion LLM onto a Mac
TL;DR I set out to make DiffusionGemma - a 26B-parameter, ~4B-active mixture-of-experts block-diffusion text model - fast enough to be a real interactive tool on a single Apple M5 Pro (48 GB, macOS 27 beta). The stock MLX inference path ran at ~18 tok/s on my original short repro, ~23–27 tok/s on a...
2026-06-15
Optimizing DiffusionGemma (26B-A4B-IT-4Bit) on an Apple M5 Pro, what worked, what didn't.



