I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
The capabilities of AI are determined by the cost function it's trained on.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs.
It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying.
It probably doesn’t really advance the field, but it can turn conjectures into theorems.
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
I don't understand the position that learning through inference/example is somehow inferior to a top down/rules based learning.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
> This problem is about improving lower bounds on the values of a sequence, , that arises in the study of simultaneous convergence of sets of infinite series, defined as follows.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
Besides the point of the supposed achievement, that is supposedly confirmed, my point will be that Epoch.ai is possibly just a PR firm for *Western* AI providers, then possibly this news is untruth worthy.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
Epoch confirms GPT5.4 Pro solved a frontier math open problem
(epoch.ai)383 points by in-silico 17 hours ago | 553 comments
Comments
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
https://epoch.ai/frontiermath/open-problems
Super cool, of course.
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.