We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
I feel very lost in these threads. A lot of people talk about getting bad results from gpt 5.5 xH or Opus 4.7 xH.
And here I am daily driving Sonnet 4.6 with medium or high thinking. I actually am thoroughly satisfied with the work it does. Perhaps it has to do with the bite sized pieces of work I give it, that fits better with my workflow.
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
We observed slightly smaller outputs over long horizon agentic coding for GPT 5.5, at a significant improvement in overall response scores. For one-shot coding responses, GPT 5.5 was actually more verbose than GPT 5.4, but again, the responses were significantly stronger. The expected cost increases reported by OpenRouter seem reasonably accurate (perhaps a bit optimistic), but in my opinion, highly worth it. GPT 5.5 has a pretty wide lead on the #2 model for understanding complex scenarios.
New model releases are now like new iPhones--mostly imperceivable improvements with a higher price tag. That's one of the major benefits to open source: you can "freeze" what model you're using. Often it's the model that you know that wins over the one that is different enough that you have to start from scratch with every major update. Most businesses require cost control and predictability over a cutting edge with limited evidence of profitable output outside of tech.
In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
GPT-5.5 Price Increase: What It Costs
(openrouter.ai)206 points by gmays 8 May 2026 | 65 comments
Comments
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
[1] https://voratiq.com/leaderboard?x=cost
And here I am daily driving Sonnet 4.6 with medium or high thinking. I actually am thoroughly satisfied with the work it does. Perhaps it has to do with the bite sized pieces of work I give it, that fits better with my workflow.
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.