Gemini with Deep Think achieves gold-medal standard at the IMO Hackernews Viewer

Gemini with Deep Think achieves gold-medal standard at the IMO

528 points by meetpateltech 21 July 2025 | 242 comments

Comments

modeless 21 July 2025

> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions

So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.

I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.

Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305

kevinventullo 21 July 2025

This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions

I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.

shiandow 21 July 2025

Comparing the answers between Openai and Gemini the writing style of Gemini is a lot clearer. It could be presented a bit better but it's easy enough to follow the proof. This also makes it a lot shorter than the answer given by OpenAI and it uses proper prose.

tornikeo 21 July 2025

I think we are having a Deep Blue vs. Kasparov moment in Competitive Math right now. This is a large progress from just a few years ago and yet I think we still are really far away from even a semi-respectable AI mathematician. What an exciting time to be alive!

Sol- 21 July 2025

> all within the 4.5-hour competition time limit

Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.

Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.

be7a 21 July 2025

Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.

The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.

sinuhe69 22 July 2025

I’m mostly agreed with this statement:

@tao I think there is a broader problem wherein competitions (math, programming, games, whatever) are meant to measure something difficult for humans, but tools work so fundamentally differently from us that success for a tool isn't even necessarily meaningful. AI companies have long viewed the IMO Grand Challenge as a sign of achieving "AGI," but no matter what set of rules a machine follows, there's no reason to believe success for a machine will correlate with broader mathematical or "reasoning" abilities in the way it does for human participants.

raincole 22 July 2025

That's some big news.

I'm a little sad about how this genuine effort - coordinated with and judged by IMO - is "front ran" by a few random tweets of OpenAI, though. Says a lot about the current situation of this industry.

gyrovagueGeist 21 July 2025

Useful and interesting but likely still dangerous in production without connecting to formal verification tools.

I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.

gundmc 21 July 2025

I assume that an "advanced version" of Gemini Deepthink means it was a different model or had significantly more test time compute than the upcoming Deepthink in the Gemini Ultra subscription. Still cool to see OpenAI and Google neck and neck.

lufenialif2 21 July 2025

Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.

vouaobrasil 21 July 2025

This is making mathematics too systematic and mechanical, and it kills the joy of it....

sanjitb 21 July 2025

Why do they brag about not using a theorem prover? To me, whatever tool helps the model perform, go for it.

Besides, they still specialized Gemini for the IMO in other ways:

> we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.

bwfan123 22 July 2025

This paper [1] shows that gemini pro 2.5 without data-contamination and some minimal prompting (llm orchestration) can solve problems 1-5 on imo 2025.

[1] https://arxiv.org/pdf/2507.15855

For problems with 2 page solutions, the search space of solutions is likely limited, and hence could be brute-forced by a search. Combinatorics is likely the hold-out since there are no set approaches one could take ie training data wont cover the space of techniques - which could explain why problem 6 stumped the llms.

Also, in the future the minimal diligence IMO problem setters could do is to test against these LLMs to ensure that they cant solve them. Further, one could expect these tools to become available to contestents just like calculators are allowed in tests these days.

Still, it is impressive for the LLMs to be able to craft watertight arguments in math at the IMO level without the use of provers.

neom 21 July 2025

How much of a big deal is this stuff? I was blessed with dyscalculia so I can hardly add two numbers together, don't pay much attention to the mathematics word, but my reading indicates this is extremely difficult/humans cannot do this?

What comes next for this particular exercise? Thank you.

brokensegue 21 July 2025

based on the score looks like they also couldn't answer question 6? has that been confirmed?

bwfan123 21 July 2025

Problem 6 is puzzling. Neither openai nor deepmind answered it. Humans would put out partial answers - but here we saw no answer which is odd.

Does that mean that the llms realized they could not solve it. I thought that was one of the limitations of LLMs in that they dont know what they dont know, and it is really impossible without a solver to know the consistency of an argument, ie, know that one knows.

jeffbee 21 July 2025

Advanced Gemini, not Gemini Advanced. Thanks, Google. Maybe they should have named it MathBard.

Dzugaru 21 July 2025

Most critical piece of information I couldn’t find is - how many shot was this?

Could it understand the solution is correct by itself (one-shot)? Or did it have just great math intuition and knowledge? How the solutions were validated if it was 10-100 shot?

foota 21 July 2025

Let's throw these deep learning models at the classification of simple finite groups and see if they can come up with something shorter than 1000 pages.

b8 21 July 2025

Surprising since Reid Barton was working on a lean system.

Workaccount2 21 July 2025

If nothing else, I'd imagine these tools will allow mathematicians to automate a lot of busy work that exists between ideas and validation.

habibur 21 July 2025

Solved 5 problems out of 6, scoring 35 out of 42. For comparison OpenAI scored 35/42 too, days back.

jtlicardo 21 July 2025

At this point, I wonder how long software engineers will keep convincing themselves they’re irreplaceable

deanCommie 21 July 2025

From Terence Tao, via mastodon [0]:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.

* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.

[0] https://mathstodon.xyz/@tao/114881418225852441

cgearhart 22 July 2025

I find this damn impressive, but I’m disappointed that while there’s some version of the final proofs released, I haven’t seen the reasoning traces. Although, I’ll admit the only reason I want to see them is because I hope they shed some light on how the models improved so much so quickly. I’d also like to see confirmation of how much compute was used.

ekojs 21 July 2025

> Btw as an aside, we didn’t announce on Friday because we respected the IMO Board's original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved

> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

From https://x.com/demishassabis/status/1947337620226240803

Was OpenAI simply not coordinating with the IMO Board then?

sxp 21 July 2025