OpenAI claims gold-medal performance at IMO 2025 Hackernews Viewer

OpenAI claims gold-medal performance at IMO 2025

478 points by Davidzheng 19 July 2025 | 698 comments

Comments

og_kalu 19 July 2025

From Noam Brown

https://x.com/polynoamial/status/1946478258968531288

"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."

and

"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."

rafael859 19 July 2025

Interesting that the proofs seem to use a limited vocabulary: https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

Why waste time say lot word when few word do trick :)

Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

tlb 19 July 2025

I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.

hislaziness 20 July 2025

Terence Tao on the matter - https://imgur.com/a/terence-tao-on-supposed-gold-imo-sMKP0bm

dylanbyte 19 July 2025

These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

gniv 19 July 2025

From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

demirbey05 19 July 2025

Google also joined IMO, and got gold prize.

https://x.com/natolambert/status/1946569475396120653

OAI announced early, probably we will hear announcement from Google soon.

modeless 19 July 2025

Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206

johnecheck 19 July 2025

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

demirbey05 19 July 2025

Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

nmca 19 July 2025

It’s interesting that this is a competition elite enough that several posters on a programming website don’t seem to understand what it is.

My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).

meroes 19 July 2025

In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.

z7 19 July 2025

Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

quirino 19 July 2025

I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.

Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.

ksec 19 July 2025

I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.

I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.

Philpax 19 July 2025

Official OpenAI announcements:

https://xcancel.com/OpenAI/status/1946594928945148246

https://xcancel.com/OpenAI/status/1946594933470900631

mehulashah 19 July 2025

The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.

There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.

More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.

gitfan86 19 July 2025

This is such an interesting time because the percentage of people who are making predictions about AGI happening on the future are going to drop off and the number of people completely ignoring the term AGI will increase.

reactordev 19 July 2025

The Final boss was:

   Which is greater, 9.11 or 9.9?

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

amelius 19 July 2025

If someone told me this say, 10 or 20 years ago, I would have assumed this was worthy of a Nobel/Turing prize ...

skepticATX 19 July 2025

Has anyone independently reviewed these solutions?

My proving skills are extremely rusty so I can’t look at these and validate them. They certainly are not traditional proofs though.

esjeon 19 July 2025

I get the feeling that modern computer systems are so powerful that they can solve almost all well-explored closed problems with a properly tuned model. The problem lies in efficiency, reliability, and cost. Increasing efficiency and reliability would require an exponential increase in cost. QC might solve that cost part, and symbolic reasoning model will significantly boost both efficiency and reliability.

strangescript 19 July 2025

We are the frog in the warm water...

orespo 19 July 2025

Definitely interesting. Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?

Second, happy to test it on open math conjectures or by attempting to reprove recent math results.

another_twist 19 July 2025

I am quite surprised that Deepmind with MCTS wasnt able to figure out math performance itself.

amelius 19 July 2025

Makes sense. Mathematicians use intuiton a lot to drive their solution seeking, and I suppose an AI such as an LLM could develop intuition too. Of course where AI really wins is search speed and the fact that an LLM really doesn't get tired when exploring different strategies and steps within each strategy.

However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.

[1] https://gpt-unicorn.adamkdean.co.uk/

ctoth 19 July 2025

Pre-registering a prediction:

When (not if) AI does make a major scientific discovery, we'll hear "well it's not really thinking, it just processed all human knowledge and found patterns we missed - that's basically cheating!"

robinhouston 19 July 2025

There is some relevant context from Terence Tao on Mathstodon:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

> * One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

> * Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

> * The team leader gives the students unlimited access to calculators, computer algebra packages, textbooks, or the ability to search the internet.

> * The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

> * The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

> * Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

> * If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

Source:

https://mathstodon.xyz/@tao/114881418225852441

https://mathstodon.xyz/@tao/114881419368778558

https://mathstodon.xyz/@tao/114881420636881657

chairhairair 19 July 2025

OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648

fxj 20 July 2025

just tried this: take the graph of the functions x^n and exp(x) how many points of intersection do they have?

chatgpt gave me the wrong answer, it claimed 2 points of intersection, but for n=4 there are 3 as one can easily derive. one for negative x and 2 points for positive x because exp(x) is growing faster than x^4.

then i corrected it and said 3 points of intersection. it said yes and gaev me the 3 points. then i said no there are 4 points of intersection and it again explained to me that there are 2 points of intersection. which is wrong.

then i asked it how many points of intersection for n=e and it said: zero

well, exp(x)=x^e for x=e, isnt it?

mohsen1 19 July 2025

O3 Pro could solve and prove the first problem when I tried:

https://chatgpt.com/s/t_687be6c1c1b88191b10bfa7eb1f37c07

cowpig 19 July 2025

https://threadreaderapp.com/thread/1946477742855532918.html

ghm2180 20 July 2025

While this is nice for splashy headlines, I like the headlines which would read some real life usecase of math grads using AI as a companion tool for solving novel problems.

pradn 19 July 2025

> Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

GPT-5 finally on the horizon!

nextworddev 20 July 2025

Yep, we are about to get fast takeoff

eboynyc32 19 July 2025

The world is changing and it’s exciting. Either you’re on or you’re off. The world doesn’t wait.

jacquesm 20 July 2025

There's that pesky Fermi Paradox. Who knows, we might solve that one too!

neets 19 July 2025

Solve aging and disease when?

YeGoblynQueenne 19 July 2025

Guys, that's nothing. My new AI system is not LLM-based but neuro-symbolic and yet it just scored 100% on the IMO 2026 problems that haven't even been written yet, it is that good.

What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".

TestTime_9000 20 July 2025

"The LLM system's core mechanism is probably a "propose-verify" loop that operates on a vocabulary of special tokens representing formal logic expressions. At inference time, it first proposes a new logical step by generating a sequence of these tokens into its context window, which serves as a computational workspace. It then performs a subsequent computational pass to verify if this new expression is a sound deduction from the preceding steps. This iterative cycle, learned from a vast corpus of synthetic proof traces, allows the model to construct a complete, validated formal argument. This process results in a system with abstract reasoning capabilities and functional soundness across domains that depend on reasoning, achieved at the cost of computation required for its extended inference time."

charlieyu1 19 July 2025

I tried P1 on chatgpt-o4-high, it tells me the solution is k=0 or 1. It doesn’t even know that k=3 is a solution for n=3. Such a solution would get 0/7 in the actual IMO.

strangeloops85 19 July 2025

It’s interesting how hard and widespread a push they’re making in advertising this - at this particular moment, when there are rumors of more high level recruitment attempts / successes by Zuck. OpenAI is certainly a master at trying to drive narratives. (Independent of the actual significance / advance here). Sorry, there are too many hundreds of billions of dollars involved to not be a bit cautious and wary of claims being pushed this hard.

suddenlybananas 19 July 2025

I don't believe this.

davidguetta 19 July 2025

Wait for the Chinese version

PokemonNoGo 19 July 2025

> level performance on the world’s most prestigious math competition

I don't know which one i would consider the most prestigious math competition but it wouldn't be The IMO. The Putnam ranks higher to me and I'm not even an American. But I've come to realise one thing and that is that high-school is very important to Americans...

stingraycharles 19 July 2025

My issue with all these citations is that it’s all OpenAI employees that make these claims.

I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.

Jackson__ 19 July 2025

Also interesting takeaways from that tweet chain:

>GPT5 soon

>it will not be as good as this secret(?) model

bluecalm 19 July 2025

My view is that it's less impressive than previous go and chess results. Humans are worse at competitive math than at those games, it's still very limited space and well defined problems. They may hype "general purpose" as much as they want but for now it's still the case that AI is super human at well defined limited space tasks and can't achieve performance of a mediocre below average human at simple tasks without those limitations like driving a car.

Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).

another_twist 19 July 2025

Its a level playing field IMO. But theres another thread which claims not even bronze and I really don't want to go to X for anything.

csomar 19 July 2025

The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust. As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.

mrdependable 19 July 2025

I like how they always say AI will advance science when they want to sell it to the public, but pump how it will replace workers when selling it to businesses. It’s like dangling a carrot while slowly putting a knife to our throats.

Edit: why was my comment moved from the one I was replying to? It makes no sense here on its own.

darkoob12 19 July 2025

I don't know how much novelty should you expect from IMO every year but i expect many of them be variation of the same problem.

These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.

There is no high generalization.

OtherShrezzing 19 July 2025

The 4 hours time limit is a bit of an odd metric, given that OpenAI have effectively an unlimited amount of compute at their disposal. If they’re running the model on 100,000 GPUs for 4hrs, that’s obviously going to have better outcomes than running it on 5.

zkmon 19 July 2025

This is an awesome progress in human achievement to get these machines intelligent. And this is also a fast regress and decline on the human wisdom!

We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?

Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?

atleastoptimal 19 July 2025

>AI model performs astounding feat everyone claimed was impossible or won’t be achieved for a while

>Commenters on HN claim it must not be that hard, or OpenAI is lying, or cheated. Anything but admit that it is impressive

Every time on this site lol. A lot of people here have an emotional aversion to accepting AI progress. They’re deep in the bargaining/anger/denial phase.

mikert89 19 July 2025

The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models

I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain

archon1410 19 July 2025

And of course it's available even in Icelandic, spoken by ~300k people, but not a single Indian language, spoken by hundreds of millions.

भारत दुर्दशा न देखी जाई...

chvid 19 July 2025

I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.

Now it is just doing a bunch of tweets?

andrepd 19 July 2025

Am I missing something or is this completely meaningless? It's 100% opaque, no details whatsoever and no transparency or reproducibility.

I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.

up2isomorphism 19 July 2025

In fact no car company claims “gold medal” performance in Olympic running even they can do that 100 yeas ago. Obviously since IMO does not generate much money so it is an easy target.

BTW; “Gold medal performance “ looks a promotional term for me.

tester756 19 July 2025

huh?

any details?

gcanyon 19 July 2025

99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.

mhh__ 19 July 2025

Just a stochastic parrot bro

matt3210 19 July 2025

If stealing content to train models is ok, then is stealing models to merge ok?

sorokod 19 July 2025

Not even bronze.

https://news.ycombinator.com/item?id=44615695

Lionga 19 July 2025

[flagged]

bwfan123 19 July 2025

I fed the problem 1 solution into gemini and asked if it was generated by a human or llm. It said:

Conclusion: It is overwhelmingly likely that this document was generated by a human.

----

Self-Correction/Refinement and Explicit Goals:

"Exactly forbidden directions. Good." - This self-affirmation is very human.

"Need contradiction for n>=4." - Clearly stating the goal of a sub-proof.

"So far." - A common human colloquialism in working through a problem.

"Exactly lemma. Good." - Another self-affirmation.

"So main task now: compute K_3. And also show 0,1,3 achievable all n. Then done." - This is a meta-level summary of the remaining work, typical of human problem-solving "

----

neonate 19 July 2025

https://xcancel.com/alexwei_/status/1946477742855532918

https://xcancel.com/polynoamial/status/1946478249187377206