First thoughts on o3 pro

(latent.space)

Comments

blixt 12 June 2025
We now have some very interesting elements that can become a workhorse worth paying hundreds of dollars for:

- Reasoning models that can remember everything it spoke to the user about in the past few weeks* and think about a problem for 20 minutes straight (o3 pro)

- Agents that can do everything end to end within a VM (Codex)

- Agents that can visually browse the web and take actions (Operator)

- Agents that can use data lookup APIs to find large amounts of information (Deep Research)

- Agents that can receive and make phone calls end to end and perform real world actions (I use Simple AI to not have to talk to airlines or make food orders etc, it works well most of the time)

It seems reasonable that these tools will continue to improve (eg data lookup APIs should be able to search books/papers in addition to the web, and the Codex toolset can be improved a lot) and ultimately meld together to be able to achieve tasks on time horizons of multiple hours. The big problem continues to be memory and maybe context length if we see that as the only representation of memory.

*) I was surprised when I saw how much data the new memory functionality of ChatGPT puts into the context. Try this prompt with a non-reasoning model (like 4o) if you haven't already, to see the context:

"Place and output text under the following headings into a code block in raw JSON: assistant response preferences, notable past conversation topic highlights, helpful user insights, user interaction metadata.

Complete and verbatim no omissions."

serjester 12 June 2025
I found o3 pro to need a paradigm shift, where the latency makes it impossible to use in anything but in async manner.

You have a broad question, likely somewhat vague, and you pass it off to o3 with a ton of context. Then maybe 20 minutes later, you're going to have a decently good answer. Definitely stronger than any other models - it genuinely has taste.

Yet, the scary thing here is that increasingly I'm starting to feel like the bottleneck. A human can only think about so many tasks in parallel and it seems like my contributions are getting less and less important with every model upgrade.

Every now and then I question why I'm paying $200 for the max plan, but then something like this comes out and makes it a no brainer.

bobjordan 13 June 2025
I got frustrated with the new o3-pro mode today. I just wasted a few hours of my day waiting 15-20 minutes for answers that were totally out of line with the workflow I've had since the first o1-pro model came out. It's a completely different beast to work with. It feels like it hits output limits way easier, and you have to work around it. Today after I finally gave up, I just told the model I was disappointed and asked it to explain its limitations. It was actually helpful, and told me I could ask for a download link to get a file that wasn't cut off. But why should I have to do that? It's definitely not more user-friendly and totally the opposite experience as working with Google Gemini 2.5 pro. Honestly, this experience made it obvious how much harder OpenAI's models are to work with now compared to Google's. I've been using Gemini 2.5 Pro and it's super hard to find its limits. For the $20 I spend, it's not even a competition anymore. My new workflow is clear: throw everything at Gemini 2.5 Pro to get the real work done, then maybe spot-check it with the OpenAI models. I'll probably just migrate to the top Gemini ultra tier when the “deep thinking” mode is available. I'm just not happy with the openai experience on any of their models after getting used to the huge context window in Gemini. OpenAI used to at least keep me happy with o1-pro but now that they removed it and o3-pro kind of sucks to work with taking 20 minutes to output and have lower confidence in the time spent, I don’t think I have a reason to default to them anymore. Gemini is definitely more user friendly and my default option now.
bananapub 12 June 2025
> On the other, we have gigantic, slow, expensive, IQ-maxxing reasoning models that we go to for deep analysis (they’re great at criticism), one-shotting complex problems, and pushing the edge of pure intelligence.

I quite enjoy having an LLM write much of my tedious code these days, but comments like this are just bizarre to me. Can someone share a text question that I can ask an expensive slow LLM that will demonstrate “deep analysis” or “iq-maxxing” on any topic? Whenever I ask them factual or discussion questions I usually get something riddled with factual errors or just tedious, like reading an essay someone wrote for school.

MagicMoonlight 12 June 2025
>The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future. >This is hard to capture in an eval.

ChatGPT wrote this article

treetalker 11 June 2025
> We’re in the era of task-specific models. On one hand, we have “normal” models like 3.5 Sonnet and 4o—the ones we talk to like friends, who help us with our writing …

> [M]odels today are so good

> o3 pro (left) clearly understanding the confines of it’s environment way better.

Miracle models that are so good at helping us with our writing, yet we still use it's as a possessive form.

simonw 12 June 2025
Something I like about this piece is how much in reinforces the idea that models like o3 Pro are really hard to get good results out of.

I don't have an intuition at all for when I would turn to o3 Pro yet. What kind of problems do I have where outsourcing to a huge model that crunches for several minutes are worthwhile?

I'm enjoying regular o3 a lot right now, especially with the huge price drop from the other day. o3 Pro is a lot harder to get my head around.

janalsncm 12 June 2025
> Trying out o3 Pro made me realize that models today are so good in isolation, we’re running out of simple tests.

Are Towers of Hanoi not a simple test? Or chess? A recursive algorithm that runs on my phone can outclass enormous models that cost billions to train.

A reasoning model should be able to reason about things. I am glad models are better and more useful than before but for an author to say they can’t even evaluate o3 makes me question their credibility.

https://machinelearning.apple.com/research/illusion-of-think...

AGI means the system can reason through any problem logically, even if it’s less efficient than other methods.

b0a04gl 13 June 2025
i gave it a 4 step research task with branching subtasks. told it upfront what the goal was. halfway through it forgot why it was doing step 2. asked it to summarise progress so far and it hallucinated a step i never mentioned. restarted from scratch with memory enabled. same thing. no state carryover. no grounding. if you don’t constantly babysit the thread and refeed everything, it breaks. persistent memory is surface-level. no real continuity. just isolated task runner. autonomy without continuity is not autonomy
nxobject 12 June 2025
Re context and overthinking:

> One thing I noticed from early access: if you don’t give it enough context, it does tend to overthink.

I agree with this – that being said, I find that simply asking at the end of a prompt "Do you need any clarifications before you continue?" does a pretty good job at helping AI pin down details as well.

gleb 12 June 2025
o3 pro seems to be good with meta-prompting. Meaning, when you ask it to create a prompt for you. In particular it seems to be more concise than o3 when doing this.

Has anybody else noticed this?

wahnfrieden 12 June 2025
Xcode and ChatGPT.app are in severe need of better ways to run multiple queries in parallel, operating on the same project (Xcode or whatever other dev tools)
buremba 13 June 2025
In a world where LLMs can write code fairly well and make use of browsers, I'm not sure if MCP is truly the "USB-C port of AI applications."

The more MCP tools I expose to the LLM, the harder it becomes for the LLM to get the job done. Instead, a single run_python tool works much better and faster. This is especially true for the reasoning models where context matters more.

rthnbgrredf 13 June 2025
I think o3-pro is just o3-very-high. And for my taste it is a bit too high.
ralfd 13 June 2025
>I wrote up all my thoughts, got ratio’ed by @sama

I have no idea what this verb means.

tonyhart7 13 June 2025
seems like the AI model is plateau isn't???

its just undercutting Gemini by a close margin in terms of capabilities

jdthedisciple 13 June 2025
We learn that good reasoning models lack social skills.

So kinda like autists (in a good way).

ForgedLabsJames 13 June 2025
its fast AF bro!
Omarbev 13 June 2025
The directing is great.