Quantitative AI progress needs accurate and transparent evaluation

(mathstodon.xyz)

Comments

fsh 25 July 2025
I believe that it may be misguided to focus on compute that much, and it would be more instructive to consider the effort that went into curating the training set. The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data. The most crass example is OpenAI paying off the FrontierMath creators last year to get exclusive secret access to the problems before the evaluation [1]. Even without resorting to cheating, competition formats are vulnerable to this. It is extremely difficult to come up with truly original questions, so by spending significant resources on re-hashing all kinds of permutations of previous question, one will probably end up very close to the actual competition set. The first rule I learned about training neural networks is to make damn sure there is no overlap between the training and validation sets. It it interesting that this rule has gone completely out of the window in the age of LLMs.

[1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...

NitpickLawyer 25 July 2025
The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks.

It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).

The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.

pu_pe 25 July 2025
> For instance, if a cutting-edge AI tool can expend $1000 worth of compute resources to solve an Olympiad-level problem, but its success rate is only 20%, then the actual cost required to solve the problem (assuming for simplicity that success is independent across trials) becomes $5000 on the average (with significant variability). If only the 20% of trials that were successful were reported, this would give a highly misleading impression of the actual cost required (which could be even higher than this, if the expense of verifying task completion is also non-trivial, or if the failures to solve the goal were correlated across iterations).

This is a very valid point. Google and ChatGPT announced they got the gold medal with specialized models, but what exactly does that entail? If one of them used a billion dollars in compute and the other a fraction of that, we should know about it. Error rates are equally important. Since there are conflicts of interest here, academia would be best suited for producing reliable benchmarks, but they would need access to closed models.

ozgrakkurt 25 July 2025
Out of topic but just opening link and actually being able to read the posts and go to profile on a browser, without an account, feels really good. Opening a mastadon profile, fk twitter
mhl47 25 July 2025
Side note: What is going on with these comments on Mathstodon? From moon landing denials, to insults, allegations that he used AI to write this ... almost all of them are to some capacity insane.
pama 25 July 2025
This sounds very reasonable to me.

When considering top tier labs that optimize inference and own the GPUs: the electricity cost of USD 5000 at a data center with 4 cents per kWh (which may be possible to arrange or beat in some counties in the US with special industrial contracts) can produce about 2 trillion tokens for the R1-0528 model using 120kW draw for the B200 NVL72 hardware and the (still to be fully optimized) sglang inference pipeline: https://lmsys.org/blog/2025-06-16-gb200-part-1/

Although 2T tokens is not unreasonable for being able to get high precision answers to challenging math questions, such a very high token number would strongly suggest there are lots of unknown techniques deployed at these labs.

If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU, then the number of tokens for 5k USD shrinks dramatically to only 66B tokens, which is still high for usual techniques that try to optimize for a best single answer in the end, but perhaps plausible if the vast majority of these are intermediate thinking tokens and a lot of the value comes from LLM-based verification.

ipnon 25 July 2025
Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.
paradite 25 July 2025
I believe everyone should run their own evals on their own tasks or use cases.

Shameless plug, but I made a simple app for anyone to create their own evals locally:

https://eval.16x.engineer/

stared 25 July 2025
I agree that after a challenge is something can be done at all (heavier-than-air flight, Moon landing, Gold medal at the IMO) then next question is makes sense economically.

I like ARC-AGI approach for the reason that it shows both axes - score and price, and place human benchmark on these.

https://arcprize.org/leaderboard

js8 25 July 2025
LLMs could be very useful in formalizing the problem and assumptions (conversion from natural language), but once problem is described in a formal way (it can be described in some fuzzy logic), then more reliable AI techniques should be applied.

Interestingly, Tao mentions https://teorth.github.io/equational_theories/, and I believe this is better progress than LLMs doing math. I believe enhancing Lean with more tactics and formalizing those in Lean itself is a more fruitful avenue for AI in math.

kristianp 14 hours ago
It's going to take a large step up in transparency for AI companies to do this. It was back in gpt 4 days that openai stopped reporting model size for example and the others followed suit.
data_maan 22 hours ago
The concept of pre-registered eval (an analogy to pre-registered study) will go a long way towards fixing this.

More information

https://mathstodon.xyz/@friederrr/114881863146859839

iloveoof 25 July 2025
Moore’s Law for AI Progress: AI metrics will double every two years whether the AI gets smarter or not.
kingstnap 25 July 2025
My own thoughts on it are that it's entirely crazy that we focus so much on "real world" fixed benchmarks.

I should write an article on it sometime, but I think the incessant focus on data someone collected from the mystical "real world" over well designed synthetic data from a properly understood algorithm is really damaging to proper understanding.

akomtu 25 July 2025
The benchmarks should really add the test of data compression. Intelligence is mostly about discovering the underlying principles, the ability to see simple rules behind complex behaviors, and data compression captures this well. For example, if you can look at a dataset of planetary and stellar motions and compress it into a simple equation, you'd be considered wildly intelligent. If you can't remember and reproduce a simple checkerboard pattern, you'd be considered dumb. Another example is drawing a duck in SVG - another form of data compression. Data extrapolation, on the other hand, is the opposite problem, which can be solved by imitation or by understanding the rules producing the data. Only the latter deserves to be called intelligence. Note, though, that understanding the rules isn't always a superior method. When we are driving, we drive by imitation based on our extensive experience with similar situations, hardly understanding the physics of driving.
BrenBarn 25 July 2025
I like Tao, but it's always so sad to me to see people talk in this detached rational way about "how" to do AI without even mentioning the ethical and social issues involved. It's like pondering what's the best way to burn down the Louvre.