The "spreadsheet" example video is kind of funny: guy talks about how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
The security risks with this sound scary. Let's say you give it access to your email and calendar. Now it knows all of your deepest secrets. The linked article acknowledges that prompt injection is a risk for the agent:
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
I've been using OpenAI operator for some time - but more and more websites are blocking it, such as LinkedIn and Amazon. That's two key use-cases gone (applying to jobs and online shopping).
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
This solves a big issue for existing CLI agents, which is session persistence for users working from their own machines.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
> Mid 2025: Stumbling Agents
The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
Whilst we have seen other implementations of this (providing a VPS to an LLM), this does have a distinct edge others in the way it presents itself. The UI shown, with the text overlay, readable mouse and tailored UI components looks very visually appealing and lends itself well to keeping users informed on what is happening and why at every stage. I have to tip my head to OpenAIs UI team here, this is a really great implementation and I always get rather fascinated whenever I see LLMs being implemented in a visually informative and distinctive manner that goes beyond established metaphors.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
And I'm still waiting for the simple feature – the ability to edit documents in projects.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
One the one hand this is super cool and maybe very beneficial, something I definitely want to try out.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
It's smart that they're pivoting to using the user's computer directly - managing passwords, access control and not getting blocked was the biggest issue with their operator release. Especially as the web becomes more and more locked down.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
For me the most interesting example on this page is the sticker gif halfway down the page.
Up until now, chatbots haven't really affected the real world for me†. This feels like one of the first moments where LLMs will start affecting the physical world. I type a prompt and something shows up at my doorstep. I wonder how much of the world economy will be driven by LLM-based orders in the next 10 years.
† yes I'm aware self driving cars and other ML related things are everywhere around us and that much of the architecture is shared, but I don't perceive these as LLMs.
I wonder if this can ever be as extensible/flexible as the local agent systems like Claude Code. Like can I send up my own tools (without some heavyweight "publish extension" thing)? Does it integrate with MCP?
We couldve easily build all these features a year ago, tools are nothing new. Its just barely useful.
Most applications now are more intuitive than our brain can think fast. I think telling an AI to find me a good flight is more work than to type in sk autocomplete for skyscanner having autocomplete for departure and for arrival allowing me to one way or return, having filters its all actually easier than to properly define the task. And we can start executing right away. Agent starts after texting so it will increase more latency. Often modern applications have problems solved that we didn’t even think about before.
Agent to me is another bullshit launch by OPENAI. They have to do something I understand but their releases are really grim to me.
Bad model, no real estate (browser, social media, OS).
Nice action plan on combining Operator and Deep Research.
One thing which stood out to me in a thought-provoking way, is that example of stickers [created first and then] being ordered (obviously: pending ordering confirmation from the user) from StickerSpark (JFYI: This is a fictional company made up in this OpenAI launch post), whereby as mentioned that ChatGPT agent has "its own computer". Thus, if OpenAI is logging into its own account on StickerSpark, then what would be StickerSpark's "normal" user-base like that of any other company's user-base of 1 user per actual person will shift to StickerSpark having a few large users via agents through OpenAI, Anthropic, Google, etc. and a medium-long tail of regular individual users. This exactly reminds of how through pervasive index fund investing that index fund houses such as BlackRock and Vanguard directly own large stakes in many S&P500 companies such that they can sway voting power [1]. Thus, with ChatGPT agent that the fundamental-regular-interaction that we assume with websites like StickerSpark would stand to alter whereby the agents would be business-facing and would have more influence on the website's features (or the Agent due to its innate intelligence will directly find another website for where features match up).
This feels a bit underwhelming to me - Perplexity Comet feels more immediately compelling as new paradigm of a natural way of using LLMs within a browser. But perhaps I'm being short-sighted
It's great to see at least one company creating real AI agents. The last six months have been agonising, reading article after article about people and companies claiming they've built and deployed AI agents, when in reality, they were just using OpenAI's API with a cron job or an event-driven system to orchestrate their GenAI scripts.
I think there will come a time when models will be good enough and SMALL enough to be localized that there will be some type of disintermediation from the big 3-4 models we have today.
Meanwhile, Siri can barely turn off my lights before bed.
The demo will be great and it will not be accurate enough or trustworthy enough to touch, however many people will start automating their jobs with it and producing absolute crap on fairly important things. We are moving from people just making things (post-truth) to the actual information all being corrupted (post-correct? there's got to be a better shorthand for this).
It’s like having a junior executive assistant that you know will always make mistakes, so you can’t trust their exact output and agenda. Seems unreliable .
Yeah but not opensource. The community is pissed off about it, Openai should contribute back to the community. Reddit blow up with angry users about the delays and missed promises of Openai release models. The fact they thay are affraid to be humiliated from the competition like "llama4 did" is not an excuse, in fact should be the motivation.
Today I made like a 100 of merge request reviews, manually inspecting all the diffs, and approving those I evaluated as valid needed contributions. I wonder if agents can help with similar workflows. It requires deep kind of knowledge of project's goals, ability to respect all the constraints and planning. But I'm certain it's doable.
It seems to me that the 2-20% of use cases where ChatGPT Agent isn't able to perform it might make sense to have a plug-in run that can either guide the agent through the complex workflow or perform a deterministic action (e.g. API call).
While they did talk about partial-mitigations to counter prompt-injection, highlighting the risks of cc numbers and other private information leaking, they did not address whether they would be handing all of that data over under the court-order to the NYT.
> These unified agentic capabilities significantly enhance ChatGPT’s usefulness in both everyday and professional contexts. At work, you can automate repetitive tasks, like converting screenshots or dashboards into presentations composed of editable vector elements, rearranging meetings, planning and booking offsites, and updating spreadsheets with new financial data while retaining the same formatting. In your personal life, you can use it to effortlessly plan and book travel itineraries, design and book entire dinner parties, or find specialists and schedule appointments.
None of this interests me but this tells me where it's going capability wise and it's really scary and really exciting at the same time.
Time to start the clock on a new class of prompt injection attacks on "AI agents" getting hacked or scammed during the road to an increase in 10% global unemployment by 2030 or 2035.
ChatGPT agent: bridging research and action
(openai.com)683 points by Topfi 17 July 2025 | 484 comments
Comments
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
just my two cents
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
> Mid 2025: Stumbling Agents The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
Up until now, chatbots haven't really affected the real world for me†. This feels like one of the first moments where LLMs will start affecting the physical world. I type a prompt and something shows up at my doorstep. I wonder how much of the world economy will be driven by LLM-based orders in the next 10 years.
† yes I'm aware self driving cars and other ML related things are everywhere around us and that much of the architecture is shared, but I don't perceive these as LLMs.
https://reddit.com/r/OpenAI/comments/1lyx6gj
Most applications now are more intuitive than our brain can think fast. I think telling an AI to find me a good flight is more work than to type in sk autocomplete for skyscanner having autocomplete for departure and for arrival allowing me to one way or return, having filters its all actually easier than to properly define the task. And we can start executing right away. Agent starts after texting so it will increase more latency. Often modern applications have problems solved that we didn’t even think about before.
Agent to me is another bullshit launch by OPENAI. They have to do something I understand but their releases are really grim to me.
Bad model, no real estate (browser, social media, OS).
One thing which stood out to me in a thought-provoking way, is that example of stickers [created first and then] being ordered (obviously: pending ordering confirmation from the user) from StickerSpark (JFYI: This is a fictional company made up in this OpenAI launch post), whereby as mentioned that ChatGPT agent has "its own computer". Thus, if OpenAI is logging into its own account on StickerSpark, then what would be StickerSpark's "normal" user-base like that of any other company's user-base of 1 user per actual person will shift to StickerSpark having a few large users via agents through OpenAI, Anthropic, Google, etc. and a medium-long tail of regular individual users. This exactly reminds of how through pervasive index fund investing that index fund houses such as BlackRock and Vanguard directly own large stakes in many S&P500 companies such that they can sway voting power [1]. Thus, with ChatGPT agent that the fundamental-regular-interaction that we assume with websites like StickerSpark would stand to alter whereby the agents would be business-facing and would have more influence on the website's features (or the Agent due to its innate intelligence will directly find another website for where features match up).
[1] https://manhattan.institute/article/index-funds-have-too-muc...
Meanwhile, Siri can barely turn off my lights before bed.
We can help gather data, crawl pages, make charts and more. Try us out at https://tabtabtab.ai/
We currently work on top of Google Sheets.
There is a widget to listen to the article instead of reading it. When I press play, it says the word ”Undefined” and then stops.
It seems to me that the 2-20% of use cases where ChatGPT Agent isn't able to perform it might make sense to have a plug-in run that can either guide the agent through the complex workflow or perform a deterministic action (e.g. API call).
They seem to fall apart browsing the web, they're slow, they're nondeterministic.
I would be pretty impressed if OpenAI has somehow cracked this.
None of this interests me but this tells me where it's going capability wise and it's really scary and really exciting at the same time.
Also why does the guy sound like he's gonna cry?