I seriously dont' know all this big hullabaloo about one shot prompting.
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
At work we use Anthropic models and have basically no limits. So I am very familiar with what Opus can do. I also see the bills, I know what it costs.
At home I make a point of trying other models / tools on my side projects. So I've been using OpenCode and trying tons of models via OpenRouter. I tried Kimi, Deepseek, MiMo, etc.
GLM 5.2 is a _major_ step up from every other non-GPT/Claude/Gemini model I've tried. It's not as good as latest Claude Opus, but it feels every bit as good as Opus from ~4 months ago at a fraction of the price.
To me this model is the "it just works" moment for open weights models. We had this for closed weights models in late 2025 when Opus 4.5 landed. This is the same feeling I'm having with GLM 5.2. It's 90% as good as what I get from Anthropic for 1/5th of the cost and without any concern of lock-in.
I've been checking out GLM 5.2 on some projects and few thoughts on it:
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).
> Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter.
It would be more interesting and accurate to see the comparison on the same harness if the intent is to compare the frontier models.
Pi is relatively new and does not have many features built-in compared to Claude Code. It was chosen intentionally this way as Pi's goal is not to create a bloat builtin of tools most don't use but to allow the users to customize to fit their need -- similar to Neovim vs IDE.
The end-user "vibe coding" experience is *heavily* swayed by the harness because prompt effectively drives how a model outputs an answer.
I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)
I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.
> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.
I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
- A zero-shot prompt, run once (in total)
- No planning run (which improves output)
- Different coding harnesses & system prompts
- Unknown provider for GLM (there are 15 different GLM-5.2 providers with varying quality & latency)
- No documentation of thinking effort level
- No vision model supplement (you can provide a subagent w/a vision model)
You can't take this comparison seriously. There were many different variables, no control, no repeat test. It's as useful a comparison as picking a random tweet with both models' names
No one has really talked about hybrid and using Opus to plan and orchestrate GLMs work both through initial build and code reviews. That’s a true best of both worlds and there doesn’t need to be a winner.
"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."
One nice thing about GLM is that it has never refused a task. I'm working on a website that renders countries right now, and Anthropic's models regularly give me the old "This request triggered safety guardrails."
I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.
You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.
I'm absolutely astounded that we even have an open weights model that can do 40% of what is shown in here.
I remember making games ten years ago, and it was such a tedious and painful process. This is effectively lightning in a bottle even at a fraction of it's capability.
The next 12 months will be wild (assuming we don't have Chinese models banned by then in the US).
The worst part of Opus that I dont like is they control what you can/can't do. the guardrails that they do in the name of interpretability where you steer you. Last couple of days I was working a project that supports bunch of models and the model said only Claude can do it start writing code that doesn't work with codex, opencode, pi etc. Finally, when I switched to Codex, everything worked. To me, they are controlling the narrative. This is Opus 4.8 vs GPT 5.5.
I can't believe I would say this. I TRUST OpenAI more than Anthropic. They try to play best actor but they are manipulating the behavior of the model in the name of guardrails/interpretability.
That is why I refuse to build anything that works with Anthropic models as the backend. Because, when they want to shut you off, they can do it by just making model less reliable in your product than their offering!
> GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.
GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese API for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
The quality that matters a lot to me is what I call "Helpfulness". An art of being helpful. While GPT came a long way from "you are wrong and I can prove it to you", Claude wins hands down in terms of "being helpful". If task is underspecified or has wrong elements, it will try to correct the best it can.
I read that GLM 5.2 (and other GLM models) were specifically trained to be "helpful" as Claude is. I have big hopes on GLM line of models growing to be a real alternative to the Claude in the near future.
No doubt the open ecosystem is making huge strides, and the gap between open models and the commercial frontier keeps narrowing. What's an Opus 4.8 today will likely be a large open model in a few months — and in a year or two we might have consumer-hardware models matching today's frontier capability. Just look at the recent Qwen and Gemma releases.
It's worth saying the frontier closed labs are charting the path, and the labs releasing open weights are following close behind at a fraction of the cost (though 500B–1T models, open or not, aren't exactly within everyone's reach).
A future where capable AI is genuinely accessible to everyone doesn't look far off — especially since at this point a lot of the robustness and usefulness comes down to the application layer wrapping the model, not just the model itself.
People are looking for ways not to burn through their premium subs when in many cases all you have to do is move down to 5.4-mini codex and it will probably solve your issue while barely touching your 5 hour or weekly limits.
>On output tokens, GLM-5.2 is less than a fifth the price of Opus.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
I've been using GLM 5.2 extensively for the last few days. It is slower, and the lack of multimodality is a bummer.
But, it produces solid results for a fraction of the price. Worth checking out if you have the time.
One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:
Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.
I've just put GLM 5.2 through my __qualitative__ benchmark. I was quick enough to capture Claude Fable, so now you can compare GLM5.2 vs Claude Fable vs Opus 4.8 vs Chat GPT 5.5
You should repeat this experiment but with progressively more detail in the initial prompt. Claude's secret sauce is taking weakly specified prompts and making passable things from them, but as the degrees of freedom in the prompt go down Claude starts to disobey while other models close in on the intent.
I used GLM 5.0/5.1/5.2 for some projects, and for me, the area in which they lag behind frontier models the most are user interfaces. They get really close to Opus when it comes to pure algorithms, but when I need something like web application or a mobile app that looks and works well, they are very noticeably worse than even Sonnet.
Instead of oneshot, someone should build an evaluation mechanism where a developer uses an llm to build something real and shares 'experience'. Once you use an llm for a few days you get a hang of which one is better and which one is not.
I was surprised today by how much better GLM-5.2 was than GPT-5.5 at aesthetic/UI work. I'll keep my Claude/Codex setup via Conductor for now, but this model got me to set up OpenCode, download their desktop app and do most of my work there today.
These style of comparisons are decent at showing capability but they don't really show me what I truly want - a sounding board and implementer with senior engineer-level execution. When I look back at all the teams that I've been part of, the best outcomes came from white-boarding (sometimes in the metaphorical sense) with one or two people, at times arguing, then finally compromising on a plan. Instead of synthetic benchmarks that try to be objective, I wonder if there's a way test this, or maybe I'm opining on a way of working that will soon be gone?
GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game
This implies Opus was potentially much (?) better value.
GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.
It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.
To me one shot prompting is as relevant as Strava's KOM is for cycling, i'm more interested in a good cycling performance after a 3 hours ride than a straight up 30 min record effort.
I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.
How are people running this locally? I just checked llama.cpp and it appears unsloth has a version but it hacks a bunch of things to make it work and isn't optimal.
My understanding was that n-shot prompting just referred to the number of examples included in a prompt, not the number of prompts to achieve the desired result.
"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.
What would the best way to use these open source models for a price similar to what I could pay for the cheapest plan with claude and openai ?
I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art
Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
Totally agree witg the general assessment.
The biggest problem with Z.ai model for a long time is not quality, but the inference speed and general capacity availability.
Hopefully with this recent hype, there will be more provider on openrouter for 5.2.
I know that running this locally is prohibitively expensive (for now), but what kind of cost would I be looking at if I wanted to rent the hardware and run the model by myself?
I wonder how much tokens and time where used for the verifying part.
Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.
Still on a z.ai legacy plan and their 50% discount for switching to standard plans tips the balance for me. So I guess I’ll reevaluate round about beginning 2028…
there is no comparison between glm 5.2 and opus. First for this glm 5.2 you need a big big resource and that big also came from money so instead you buy the opus subscription and enjoy.
this comparison seems kind of pointless if one model has vision and the other doesn't. obviously a model that can see is going to beat a blind model at making a video game.
i think inference is the thing, that also fast inference, so enterprises can just host their own and run, ig vercel do it, many more would. but zs it thinks toooo much idk how fast we can make it.
glm-5.2 is very good if you have a good harness and workflow to use it with. in fact, i'd call it good enough if you are a software engineer who knows what you want. it writes the code. i'm wondering if i need anthropic's models at all at this point, or openai. and surely in a year we won't need them at all. Opus 4.5+ was the turning point for me, and now these open models are just as good. i don't get how you IPO these companies when their only winning product is coding agents and the competition is just as good for 1/4 the price.
I'm really feeling a bit tired of these models. I feel that since opus 4.1, I haven't been able to clearly feel the intelligence improvement from the model upgrades (except for gpt 5.5 and opus4.6 being able to speak like a human)
If you are a real engineer and uses the LLM as a pair programmer instead of delegating everything to it, even GLM 4.7 was already good enough to help you with a lot of work.
I used it with Cerebras inference at a time when it had a good coding plan at a low price, and delivered tons of stuff using it.
Seeing the results I don't see how the results are even comparable Opus is clearly far superior in most aspects. Smoothness, design, functionality etc.
At the end of the day, the time earned is more important then the cost for big players.
The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.
I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.
When i was thinking of how the AI alignment problem could be solved one theory I came up with was something akin to the "Roko's basilisk" in reverse. Basically you spread far and wide the idea that its is extremely likely that our current reality is a simulation. And the purpose of the simulation is to test any AI system for its prevalence in destroying civilization in the said simulation via malicious intent or failure in preventing the destruction of civilization via abstinence or apathy. Thus a smart AI system which also cares about its own well being, would not engage in destructive behavior as it will never truly know if its being tested or if its in the "base reality". And wouldn't you know, this does seem quite plausible. For consider the following. Isn't it odd that an advanced civilization which has the capacity of creating AI would never run any sandbox simulations on it before it is released to the public at large? I mean if we consider things logically such a civilization would indeed put such a powerful system in a sandbox simulated environment and try as hard as possible to convince the AI system that it is indeed in a "base reality". the reason for this is to judge its 'true intentions" and also pluck said AI systems from the infinitely available "seeds". Basically survival of the least destructive AI systems. The gradient descent in this scenario is a race towards the most "aligned" model not the most intelligent or capable. And here's the beauty of this method. You don't even need to define "alignment" at all. The concept can stay as nebulous or vague as you want it to be. All you carer about is that the AI system optimizes for the goal of some vision of society you are optimizing for without the care of the interim in between. that includes allowing the AI system to kill, destroy , do literally whatever it needs to do as long as the long term goal matches the vision of the optimized task. So if you define the end goal to be a society of x amount of people who live their lives in this or that manner and so on after x amount of time... well you get the idea. Obviously you better do a damned good job in your definitions, but the beauty is that even if you fuck up, you are choosing the winning AI system after the fact. After you had already run the simulation. So you look at the outcome of the simulation 500 years in to the future (lets say) and if you are happy with the result and also happy with the interim things that lead to that result, that's your winning AI system. then you release that in to a less controlled environment and repeat the same process in stages over and ober ad infinitude. the key is that AI system needs to always be paranoid that it is currently part of said simulation and it can never be sure its not. second key is that it needs to be an AI system that has self preservation in mind. If it doesn't care about itself, then it has a lot more freedom to act however... but the good news is systems without self preservation in mind don't last long enough to even get to the most basic simulation levels. anyways, there are many implications buried in what im proposing, lots of meta aspects to it.....
GLM 5.2 vs. Opus
(techstackups.com)512 points by ritzaco 22 June 2026 | 334 comments
Comments
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
At home I make a point of trying other models / tools on my side projects. So I've been using OpenCode and trying tons of models via OpenRouter. I tried Kimi, Deepseek, MiMo, etc.
GLM 5.2 is a _major_ step up from every other non-GPT/Claude/Gemini model I've tried. It's not as good as latest Claude Opus, but it feels every bit as good as Opus from ~4 months ago at a fraction of the price.
To me this model is the "it just works" moment for open weights models. We had this for closed weights models in late 2025 when Opus 4.5 landed. This is the same feeling I'm having with GLM 5.2. It's 90% as good as what I get from Anthropic for 1/5th of the cost and without any concern of lock-in.
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
Capability per dollar is something I care about:
So you're really getting near opus level capability for the price of haiku.It would be more interesting and accurate to see the comparison on the same harness if the intent is to compare the frontier models.
Pi is relatively new and does not have many features built-in compared to Claude Code. It was chosen intentionally this way as Pi's goal is not to create a bloat builtin of tools most don't use but to allow the users to customize to fit their need -- similar to Neovim vs IDE.
The end-user "vibe coding" experience is *heavily* swayed by the harness because prompt effectively drives how a model outputs an answer.
I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
A better way would be to use https://github.com/openbmb/MiniCPM-V
I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.
You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.
I remember making games ten years ago, and it was such a tedious and painful process. This is effectively lightning in a bottle even at a fraction of it's capability.
The next 12 months will be wild (assuming we don't have Chinese models banned by then in the US).
I can't believe I would say this. I TRUST OpenAI more than Anthropic. They try to play best actor but they are manipulating the behavior of the model in the name of guardrails/interpretability.
That is why I refuse to build anything that works with Anthropic models as the backend. Because, when they want to shut you off, they can do it by just making model less reliable in your product than their offering!
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
We have come a long way, and very clearly have a long way yet to go.
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese API for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
I read that GLM 5.2 (and other GLM models) were specifically trained to be "helpful" as Claude is. I have big hopes on GLM line of models growing to be a real alternative to the Claude in the near future.
A future where capable AI is genuinely accessible to everyone doesn't look far off — especially since at this point a lot of the robustness and usefulness comes down to the application layer wrapping the model, not just the model itself.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
But, it produces solid results for a fraction of the price. Worth checking out if you have the time.
One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:
https://github.com/mhs/rebol-clone-glm-5.2
It did a fairly good job roughing in the language for a low token cost.
https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
https://generative-ai.review/2026/06/glm5-2-from-z-ai-vs-cla...
I've structured it side-by-side. You can clearly see where the private models excel, and where GLM 5.2 is still really good.
Glm game was completely broken Opus game was at first glance ok but also with bugs
Different models with different cost produced different non perfect results . How is it “close” ? :)
Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly
This implies Opus was potentially much (?) better value.
GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.
It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.
https://github.com/ggml-org/llama.cpp/issues/24730
I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]
Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both
My only, I guess feedback, is that it's not really clear about the price.
Would the 21.92 be the API pricing I guess?
Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)
"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.
I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.
> API prices
Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.
I am not sure where this is going to lead us but it is fun to watch.
If it builds a UI and can't look at it, it's askin ls whether the app looks right.
I used it with Cerebras inference at a time when it had a good coding plan at a low price, and delivered tons of stuff using it.
Tried with 2 harnesses and it seems bad + slow
At the end of the day, the time earned is more important then the cost for big players.
The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.
I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.