Some engineers on my team at Assembled and I have been a part of the alpha test of Codex, and I'll say it's been quite impressive.
We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:
Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.
In the preview video, I appreciated Katy Shi's comment on "I think this is a reflection of where engineering work has moved over the past where a lot of my time now is spent reviewing code rather than writing it."
As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.
While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.
[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.
They mentioned "microVM" in the live stream. Notably there's no browser or internet access. It makes sense, running specialized Firecracker/Unikraft/etc microkernels is way faster and cheaper so you can scale it up. But there will be a big technical scalability difficulty jump from this to the "agents with their own computers". ChatGPT Operator already does have a browser, so they definitely can do this, but I imagine the demand is orders of magnitudes different.
There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.
Im sorry if Im being silly, but I have paid for the Pro version, $200 a month, everytime I click on Try Codex, it takes me to a pricing page with the "Team Plan" https://chatgpt.com/codex#pricing.
Is this still rolling out? I dont need the team plan too do I?
I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.
I'm not sure what's wrong with me, but I just wasted several hours wrestling codex to make it behave.
Here's my workflow that keeps failing:
- it writes some code. It looks good a first glance
- I push it to github
- automated tests on github show that there's a problem
- go back to codex and ask it to fix it
- it does stuff. It looks good again.
Now what do I do? If I ask it to push again to github, then it will often create a pull request that doesn't include stuff from the first pull request, but it's not a pull request that stacks on top of the previous pull request, it's a pull request that stacks on top of main.
When asked to write something that called out to gpt-4.1-mini, it used openai.ChatCompletion.create (!?!!?)
I just found myself using claude to fix codex's mistakes.
I used to work for a bank and the legal team used to ping us to make tiny changes to the app for compliance related issues. Now they can fix themselves. I think they’d be very proud and happy
"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."
What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.
is the point of this to actually assign tasks to an AI to complete end to end? Every task I do with AI requires atleast some bit of hand holding, sometimes reprompting etc. So I don't see why I would want to run tasks in parallel, I don't think it would increase throughput. Curious if others have better experiences with this
Reading these threads its clear to me people are so cooked and no longer understand (or perhaps never did) understand the simple process of how source code is shared, built, and merged together with multiple editors has ever worked
What about using it for AI / developing models that compete with our new overlords?
Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?
So it's looking like it's only running in the cloud, that is it will push commits to my remote repo before I have a chance to see if it works?
When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.
Has anyone else been able to get "secrets" to work?
They seem to be injected fine in the "environment setup" but don't seem to be injected when running tasks against the enviornment. This consistently repros even if I delete and re-create the enviornment and archive and resubmit the task.
Is this the same idea as when we switched to multicore machines? The rate of change on the capabilities of a single agent has slowed enough now the only way for OpenAI to appearing to be making decent progress is to have many?
Maddening: "codex" is also the name of their open-source Claude-Code-alike, and was previously the name of an at-the-time frontier coding model. It's like they name things just to fuck with us.
When it runs the code I assume it does so via a docker container, does anyone know how it is configured? Assuming the user hasn't specified an AGENTS.md file or a Dockerfile in the repo. Does it generate it via LLM based on the repo, and what it thinks is needed? Does it use static analysis (package.json, requirements txt, etc)? Do they just have a super generic Dockerfile that can handle most envs? Combination of different things?
I remember HN had a repeating popular post on the the most important data structures. They are all the basic ones that a first-year college student can learn. The youngest one was skiplist, which was invented in 1990. When I was a student, my class literally read the original paper and implemented the data structure and analyzed the complexity in our first data structure course.
This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.
Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.
So there's this thing called "Setup Scripts" but they don't explicitly say these are equivalent to AWS Metadata and configured inside of Codex web interface - not a setup.sh or a package.json preinstall declaration. I wasted several hours (and lots of compute where Codex was as confused as I was) trying to figure out how to convince codex to pnpm install.
Is there an open source version of this? that essentially uses microvms to git clone my repo and essentially run codex-cli or equivalent and sends me a PR.
I've been experimenting with providers offering similar functionality for the last year and it is really vastly superior experience this codex like approach than cursor, devin etc
> To balance safety and utility, Codex was trained to identify and precisely refuse requests aimed at development of malicious software, while clearly distinguishing and supporting legitimate tasks.
I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.
One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
Is anyone using any of these tools to write non boilerplate code?
I'm very interested.
In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.
These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.
And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.
That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.
I'm curious how many ICs are truly excited about these advancements in coding agents. It seems to me the general trend is we become more like PMs managing agents and reviewing PRs, all for the sake of productivity gains.
I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.
As someone who works on his own open source agent framework/UI (https://github.com/runvnc/mindroot), it's kind of interesting how announcements from vendors tend to mirror features that I am working on.
For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.
So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.
By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.
But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.
I believe that code from one of these things will eventually cause a disaster affecting the capital owners. Then all of a sudden you will need a PE license, ABET degree, 5 years working experience, etc. to call yourself a software engineer. It would not even be historically unique. Charlatans are the reason that lawyers, medical doctors, and civil engineers have to go through lots of education, exams, and vocational training to get into their profession. AI will probably force software engineering as a profession into that category as well.
On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.
Im super curious to see how this actually does at finding significant bugs, we've been working in the space on https://www.bismuth.sh for a while and one of the things we're focused on is deep validation of the code being outputted.
There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.
so i just upgraded to pro plan but yet https://chatgpt.com/codex doesnt work for me and asks me to -try chatgpt pro- and shows me the upsell modal, even if already on the higher tier
A Research Preview of Codex
(openai.com)511 points by meetpateltech 16 May 2025 | 450 comments
Comments
We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:
Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.
Preview video from Open AI: https://www.youtube.com/watch?v=hhdpnbfH6NU&t=878s
As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.
While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.
There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.
Is this still rolling out? I dont need the team plan too do I?
I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.
Here's my workflow that keeps failing: - it writes some code. It looks good a first glance - I push it to github - automated tests on github show that there's a problem - go back to codex and ask it to fix it - it does stuff. It looks good again.
Now what do I do? If I ask it to push again to github, then it will often create a pull request that doesn't include stuff from the first pull request, but it's not a pull request that stacks on top of the previous pull request, it's a pull request that stacks on top of main.
When asked to write something that called out to gpt-4.1-mini, it used openai.ChatCompletion.create (!?!!?)
I just found myself using claude to fix codex's mistakes.
What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.
What about using it for AI / developing models that compete with our new overlords?
Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?
When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.
They seem to be injected fine in the "environment setup" but don't seem to be injected when running tasks against the enviornment. This consistently repros even if I delete and re-create the enviornment and archive and resubmit the task.
I don't understand why OAI puts their alpha release products under a $200 a month plan instead of just charging for tokens.
This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.
Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.
I made one for github action but it's not as realtime and is 2 years old now: https://github.com/asadm/chota
This should be possible today and surely Linus would also see this in the future.
(Im trying something)
what would be an impressive program that an agent should be able to one-shot in one go?
I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.
One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
Feels like codex is for product managers to fix bugs without touching any developer resources. Then it’s insanely surprising!
I'm very interested.
In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.
These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.
And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.
That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.
What am I missing?
I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.
For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.
So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.
By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.
But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.
On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.
There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.
sigh