GPT-5.3-Codex

(openai.com)

Comments

Rperry2174 5 February 2026
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

that feels like a reflection of a real split in how people think llm-based coding should work...

some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

granzymes 5 February 2026
I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

xiphias2 5 February 2026
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''

While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.

It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.

In simpler terms: Codex should write secure software by default.

itay-maman 5 February 2026
Something that caught my eye from the announcement:

> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training

I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

minimaxir 5 February 2026
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
SunshineTheCat 5 February 2026
I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.

I know that's anecdotal, but it just seems Claude is often the default.

I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.

However, the note I see the most from Claude users is running out of usage.

Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).

That (at least to me) seems to be a much bigger deal than coding nuances.

bgirard 5 February 2026
> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

tosh 5 February 2026
Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |
nananana9 5 February 2026
I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:

Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

RivieraKid 5 February 2026
Do software engineers here feel threatened by this? I certainly am. I'm surprised that this topic is almost entirely missing in these threads.
nickandbro 15 hours ago
I have found GPT 5.3-Codex to do exceedingly well when working with graphics rendering pipelines. They must have better training data or RL approaches than Antropic as I have given the same prompt and config to Opus 4.6 and it seems to have added unwanted rendering artifacts. This may be just an issue specific to my use case, but wonder since OpenAI is partners with MSFT, which makes lots of games, that this may be an area they heavily invested in
jstummbillig 17 hours ago
It's so interesting that I start to feel a change, that is developing as a separate thing to capability. Previously, yeah sure, things changed but models got so outrageously better at the basic things that I simply wouldn't care.

Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.

I can see how one might get from here to the whole people being upset about 4o thing.

trilogic 5 February 2026
When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?

Interesting

tombert 5 February 2026
Actually kind of excited for this. I've been using 5.2 for awhile now, and it's already pretty impressive if you set the context window to "high".

Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.

I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!

arjun810 14 hours ago
Our results on our rails app surprised us. Codex 5.3 far and away the best — much faster and cheaper (though cost isn’t relevant yet, since you can only access this model via a ChatGPT plan).

https://x.com/sergeykarayev/status/2019541031986032925?s=46

ksynwa 18 hours ago
Why does OpenAI have a separate model for coding (Codex) but Anthropic uses the same model for chatbots and coding?
morleytj 5 February 2026
The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.
dllrr 5 February 2026
Using opus 4.6 in claude code right now. It's taking about 5x longer to think things through, if not more.
netdevphoenix 17 hours ago
How come that OpenAI and Anthropic both released their models pretty much at the same time? Does anyone know if the timing is coincidental?
modeless 5 February 2026
It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.
energy123 6 February 2026
First impression: It's much faster for the same task.

When they hook it up to Cerebras it's going to be a head-exploding moment.

gallerdude 5 February 2026
Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.
kingstnap 5 February 2026
> GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership.

This is hilarious lol

ffitch 5 February 2026
> our team was blown away > by how much Codex was able > to accelerate its own development

they forgot to add “Can’t wait to see what you do with it”

textlapse 5 February 2026
I would love to see a nutritional facts label on how many prompts / % of code / ratio of human involvement needed to use the models to develop their latest models for the various parts of their systems.
karmasimida 5 February 2026
For those who cared:

GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).

Anyone knows the difference between OSWorld vs OSWorld Verified?

ponyous 5 February 2026
I think models are smart enough for most of the stuff, these little incremental changes barely matter now. What I want is the model that is fast.
cheriot 23 hours ago
Page me when codex can run the right version of node. Are we all changing the system node version to match the current project again?

[shell_environment_policy]

inherit = "all"

experimental_use_profile = true

[shell_environment_policy.set]

NVM_DIR = "[redacted]"

PATH = "[redacted]"

Robin_f 5 February 2026
Anthropic mostly had an advantage in speed. It feels like with a 25% increase in speed with Codex 5.3, they are now losing that advantage as well.
DrBazza 14 hours ago
It took several decades for the language server protocol and debugger server protocol (or whatever it's called). Is there a common 'agent' protocol yet? Or are these companies still in the walled-garden phase?
dawidg81 5 February 2026
May AI not write the code for me.

May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.

exabrial 6 February 2026
After using Anthropic's products, I think it's going to be difficult to go back to OpenAI. It feels more like a discussion with a peer; ChatGPT has always felt like arguing with an idiot on Reddit.
aavci 16 hours ago
To anyone trying this, does this unlock anything you tried to do with the past LLM models but failed and now you can try again? Do you find this as an incremental improvement or something that brings in new opportunities?
imasliev 5 February 2026
GPT-5.2-Codex was so cool at price/value rate, hope 5.3 will not ruin the race with claude
mmaunder 5 February 2026
Take a screenshot of ARG-AGI-2 leaderboard now because GPT-5.3-Codex isn't up there yet and I suspect it'll cram down Claude Opus 4.6 which rules the roost for the next few hours. King for a day.
farazbabar 6 February 2026
I have wanted to hold back from answering comments that ask for proof of real work/productivity gains because everyone works differently, has different skill levels and frankly not everyone is working on world changing stuff. I really liked a comment someone made a few of these posts ago, these models are amazing! amazing! if you don't actually need them, but if you actually do need them, you are going to find yourself in a world of hurt. I cannot agree more, I (believe) I am a good software engineer, I have developed some interesting pieces of software over the decades and usually when I got passionate about a project, I could do really interesting things within weeks, sometimes months. I will say this, I am working on some really cool stuff, stuff I cannot tell you about, or else. And my velocity is for what used to take months is days and hours for what used to take weeks. I still review everything, I understand all the gotchas of distributed systems, performance, latency/throughput, C, java, SQL, data and infra costs, I get all of it so I am able to catch these mofos when they are about to stab me in the back but man! my productivity is through the roof. And I am loving it. Just so I can avoid saying I cannot tell you I am working on, I will start something that I can share soon (as soon as decades of pent up work is done, its probably less than a few months away!). Take it with a grain of salt, and know this, these things are not your friends, they WILL stab you in the back when you least expect them, cut a corner, take a short cut, so you have to be the PHB (dilbert reference!) with actual experience to catch them slacking. Good luck.
prng2021 5 February 2026
Did they post the knowledge cutoff date somewhere
koolala 5 February 2026
I want to recompile a Rust project to be f32 instead of f64.

Am I better off buying 1 month of Codex, Claude, or Antigravity?

I want to have the agent continuesly recompile and fix compile errors on loop until all the bugs from switching to f32 are gone.

binsquare 5 February 2026
At first try it solved a problem that 5.2 couldn't previously.

Seems to be slower/thinks longer.

EZ-E 6 February 2026
Can someone explain the difference between this and VSCode agent chat? Except the fact that it's a separate app?
aavci 16 hours ago
“our team was blown away by how much Codex was able to accelerate its own development.”

At what point will LLMs be autonomously self creating new versions of themselves?

gwd 5 February 2026
gpt-5.3-codex isn't available on the API yet. From TFA:

> We are working to safely enable API access soon.

jdthedisciple 5 February 2026
Gotta love how the game demo's page title is "threejs" – I guess the point was to demo its vibe-coding abilities anyway, but yea..
sidgarimella 5 February 2026
Many are saying codex is more interactive but ironically I think that very interactivity/determinism works best when using codex remotely as a cloud agent and in highly async cases. Conversely I find opus great locally, where I can ram messages into it to try to lever its autonomy best (and interrupt/clean up)
tyfon 5 February 2026
I'm having a hard time parsing the openai website.

Anyone know if it is possible to use this model with opencode with the plus subscription?

__mharrison__ 5 February 2026
I never really used Codex (found it to slow) just 5.2, which I going to be an excellent model for my work. This looks like another step up.

This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.

GenerWork 5 February 2026
I find it very, very interesting how they demoed visuals in the form of the “soft SaaS” website and mentioned how it can do user research. Codex has usually lagged behind Claude and Gemini when it comes to UX, so I’m curious to see if 5.3 will take the lead in real world use. Perhaps it’ll be available in Figma Make now?
kingstnap 5 February 2026
That was fast!

I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?

foft 5 February 2026
Having used codex a fair bit I find it really struggles with … almost anything. However using the equivalent chat gpt model is fantastic. I guess it’s a matter of focus and being provided with a smaller set of code to tackle.
rustyhancock 5 February 2026
Anyone remember the dot-com era when you would see one provider claim the most miles of fibre and then later that week another would have the title?
ecshafer 5 February 2026
Funny that this and Opus 4.6 released within minutes of each other. Each showing similar score improvements. Each claiming to be revolutionary.
jpau 5 February 2026
Interesting that this was released without a prior GPT-5.3 release. I wonder if that means we won't see a GPT-5.3?
vatsachak 5 February 2026
AI designed websites are so easy to spot that I need to actively design my UI so that it doesn't look AI
synergy20 5 February 2026
i like the opus 4.6 announcement a lot more, concise and to the point. for the 5.3 codex, it's a long post, but still, the most important info, the context window, is nowhere to be found. thus, I'm keeping using opus.
davidmurdoch 5 February 2026
I've been using 5.2 the way they're describing the new use case for 5.3 this whole time.
maheshrijal 5 February 2026
It seems Fast!
edem 5 February 2026
So can I use this from Opencode? Because Anthropic started to enforce their TOS to kill the Opencode integration
PieUser 5 February 2026
How'd they both release at the same time? Insiders?
simianwords 5 February 2026
Any notes on pricing?
kopollo 5 February 2026
Where is the google?
bryanhogan 5 February 2026
The most important question: Can it do Svelte now?
bg24 5 February 2026
I am on a max subscription for Claude, and hate the fact that OpenAI have not figured out that $20 => $200 is a big jump. Good luck to them. In terms of model, just last night, Codex 5.2 solved a problem for me which other models were going round and round. Almost same instructions. That said, I still plan to be on $100 Claude (overall value across many tasks, ability to create docs, co-work), and may bump up OpenAI subscription to the next tier should they decide to introduce one. Not going to $200 even with 5.3, unless my company pays for it.
virtualzx 5 February 2026
is so fun that the two releases used almost completely non-overlapping benchmarks!
jiggawatts 5 February 2026
I think this announcement says a lot about OpenAI and their relationship to partners like Microsoft and NVIDIA, not to mention the attitude of their leadership team.

On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.

I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.

The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."

Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.

petetnt 5 February 2026
Whoa, I think GPT-5.2-Codex was a disappointment, but GPT-5.3-Codex is definitely the future!
roya51788 5 February 2026
what are the benchmarks against opus 4.6?
mrcwinn 5 February 2026
According to Sam Altman, Anthropic is for "rich people." Judging by his $4 million man-baby Koeniggsegg, he must be a huge Claude Code user!
drcongo 5 February 2026
Does it insert adverts in your code?
hubraumhugo 5 February 2026
Anybody else not seeing it available in Codex app or CLI yet (with Plus)?
heraldgeezer 5 February 2026
Anthropic and GTP 2 new models at once?
wahnfrieden 5 February 2026
Pelican seems much worse than the Opus 4.6 one (though the bicycle is more accurate):

https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

OutOfHere 5 February 2026
It is absurd to release 5.3-Codex before first releasing 5.3.

Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.

nubg 5 February 2026
lmao so cringe that they delay releasing the model until anthropic does
raincole 5 February 2026
Almost like Anthropic and OpenAI are trying to front run each other
I_am_tiberius 5 February 2026
I'd like to know if and how much illegal use of customer prompts are used for training.
shibeprime 5 February 2026
I know we just got a reset and a 2× bump with the native app release, but shipping 5.3 with no reset feels mismatched. If I’d known this was coming, I wouldn’t have used up the quota on the previous model.
maxpert 5 February 2026
Is this me or Sam is being absolute sore loser he is and trying to steal Opus thunder?