6 May 2026 · ai · local-first

Multi-token prediction, on a small laptop

Google shipped multi-token prediction for Gemma 4. The headline 3x speedup is for big GPUs, but the edge-model drafters quietly target laptops like mine.

A modest laptop on a dark wooden desk at night, lid open, screen glowing cool cyan. Above the keyboard, parallel columns of small luminous tokens stream upward in a cyan-to-mint gradient, with one rogue token glowing magenta. Deep indigo and violet atmospheric shadow surrounds the scene. Cinematic dark editorial mood.

Google announced multi-token prediction for Gemma 4 this week, and most of the press release talks about big numbers on big hardware: three times the throughput on an RTX PRO 6000, faster batch inference on A100s, the usual diet of tidy graphs. I run Gemma 4 on a Dell laptop running Ubuntu, with an RTX PRO 500 Blackwell and six gigabytes of VRAM, so most of those numbers aren’t aimed at me. The 26-billion-parameter Mixture-of-Experts model isn’t going to fit on my card no matter how clever the inference trick gets. But the announcement also quietly mentioned drafters for the small edge models, E2B and E4B, and those are exactly the ones I run.

The trick isn’t new. Speculative decoding has been around since a Google paper in 2022: a small “drafter” model proposes a handful of tokens at once, the bigger target model verifies them in a single parallel pass, and any tokens it agrees with come for free.

Why it works is the interesting bit. Token generation on a single GPU isn’t really bottlenecked by compute, it’s bottlenecked by memory bandwidth, the time it takes to drag billions of parameters from VRAM to the compute units. Verifying four tokens in parallel costs about the same as verifying one, because the silicon was waiting on memory anyway.

I keep coming back to local models for reasons that aren’t profound. Latency on a model running on the metal next to me beats almost any cloud round-trip, and the rhythm of a response that arrives in the same second you stopped typing changes which questions you bother to ask. Running things locally means they keep working on a train, on bad hotel WiFi, in a café where the captive portal hates everyone. And there’s a small ergonomic pleasure in owning the whole stack: no quota dashboards, no API keys to rotate, no model deprecations announced via blog post.

What I’d actually try

Two ideas, both speculative.

The first is asking questions of a codebase. I do this in a small way today, mostly with open-source projects I’m picking through, and the experience is somewhere between useful and frustrating. The latency on E4B is just slow enough that I sometimes alt-tab away while it’s still composing the answer, which entirely defeats the point. A 1.5x speed-up would tip the whole interaction across a quiet threshold from “novel toy” to “tool I’d reach for again tomorrow.”

The second is summarisation: RFCs, dense blog posts, the occasional academic paper I really should have read three months ago. Summarisation is forgiving on accuracy and sensitive to latency, which is the cleanest fit for a small local model. I’d like it fast enough to be the default first thing I do with any document over a couple of thousand words.

Both of those rest on the speed-up actually landing on my hardware, and that’s where I’d be cautious. The 3x figure in the announcement is a single-stream best case on a data-centre GPU with much more memory bandwidth than my laptop has, and the 2.2x on Apple Silicon is at batch sizes of four to eight, which isn’t a workload most laptop users hit on a single interactive prompt. My honest expectation is somewhere between 1.4x and 1.7x for E4B on my hardware, on the kinds of prompts I actually feed it. Worth doing, but not worth getting breathless about.

The other thing I’m sceptical about is anything resembling an agent loop, where the model calls a tool, reads the output, and decides what to do next. Each round trip pays the latency cost in full, and on six gigabytes of VRAM I suspect we’re still some way off the throughput that makes those loops feel responsive rather than glacial.

The pattern I keep noticing with model releases is that the announcements I should pay attention to aren’t usually the ones with the largest numbers in the headline. They’re the ones that close the gap between “research result on a cluster” and “thing I can run on my laptop on a Tuesday morning.” Multi-token prediction for the edge variants of Gemma 4 sits in that category, which is why I’ll be trying it once the tooling settles.