Since I started posting about how token speed is the easiest optimisation and maintains developer engagement (if that’s a thing you still want!), OpenAI started offering Codex at 1200 token/sec and Anthropic added a /fast mode to Claude where you pay extra to run Opus faster.
I’m really interested in AFK dev and working on things in that space at the moment, and the effort going into autonomy is still very important. But I do wonder how we would feel about the effort going into parallelisation and overnight runs if we had say 2000 token/sec inference with our favourite models.
