Grok 3: Another win for the bitter lesson

(thealgorithmicbridge.com)

Comments

bccdee 20 February 2025
The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok 3 than even OpenAI could" is further evidence that hyper-scaling is a dead end which will only yield incremental improvements.

Obviously more computing power makes the computer better. That is a completely banal observation. The rest of this 2000-word article is groping around for a way to take an insight based on the difference between '70s symbolic AI and the neural networks of the 2010s and apply it to the difference between GPT-4 and Grok 3 off the back of a single set of benchmarks. It's a bad article.

bambax 20 February 2025
This article is weak and just general speculation.

Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:

> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.

https://x.com/skdh/status/1892432032644354192

Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

thatgerhard 16 minutes ago
I've been using Grok3 with deep think for 2 days now and the things it built has been waaaaay passed any other LLM I've tried
smy20011 20 February 2025
Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

rfoo 20 February 2025
I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.

Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Maybe they have 50k now after getting the world's attention and having national-sponsored grey market back them, but that 50k number is certainly dreamed-up. During the past year DeepSeek's intern recruitment ads always just mentioned "unlimited access to 10k A100s", suggesting that they may have very limited H100/H800s, and most of their research ideas were validated on smaller models on an Ampere cluster. The 10k A100 number matches with a cluster their parent hedge fund company announced a few years ago. All in all my estimation is they had more (maybe 20k) A100s, and single-digit thousands of H800s.

viraptor 20 February 2025
This is a weird takeaway from the recent changes. Right now companies can scale because there's stupid amount of stupid money flowing into the AI craze, but that's going to end. Companies are already discovering the issues with monetising those systems. Sure, they can "let go" and burn the available cash, but the investors will come knocking finally. Since everyone figures out similar tech anyway, it's the people with most tech improvement experience that will be in the best position long term, while openai will be stuck trying to squeeze adverts and monitoring into their chat for cash flow.
nickfromseattle 20 February 2025
Side question, let's say Grok is comparable in intelligence to other leading models. Will any serious business switch their default AI capabilities to Grok?
aqueueaqueue 20 February 2025
How bitter is the bitter lesson when throwing more compute js costing billions. Maybe the bitter lesson is more about money now than the hardware. You are scaling up investments not just relying on moores law. But I think there is a path for the less power consuming models that people can run affordably without VC money.
Rochus 20 February 2025
Interesting, but I think the article’s argument for the "bitter lesson" relies on logical fallacies. First, it misrepresents critics of scaling as dismissing compute entirely, then frames scaling and optimization as mutually exclusive strategies (which creates a false dilemma), ignoring their synergy. E.g. DeepSeek’s algorithmic innovations under export constraints augmented - and not replaced - the scaling efforts. The article also overgeneralizes from limited cases, asserting that compute will dominate the "post-training era" while overlooking potential disruptors like efficient architectures. The CEO's statements are barely suited to support its claims. A balanced view aligning with the "bitter lesson" should recognize that scaling general methods (e.g. learning algorithms) inherently requires both compute and innovation.
user14159265 20 February 2025
It will be interesting to see how talent acquisition evolves. Many great engineers were put off by strong DEI-focused PR, and even more oppose the sudden opportunistic shift to the right. Will Muslims continue to want to work for Google? Will Europeans work for X? Some may have previously avoided close relations with China for ethical reasons—will the same soon apply to the US?
Amekedl 20 February 2025
another ai hype blog entry. Not even a mention of the differently colored bars on the benchmark result. For me, grok-3 does not prove/disprove scaling laws in any meaningful capacity.
ArtTimeInvestor 20 February 2025
It looks like the USA is bringing all technology in-house that is needed to build AI.

TSMC has a factory in the USA now, ASML too. OpenAI, Google, xAI and Nvidia are natively in the USA.

While no other country is even close to build AI on their own.

Is the USA going to "own" the world by becoming the keeper of AI? Or is there an alternative future that has a probability > 0?

PaulHoule 20 February 2025
Inference cost rules everything around me.
petesergeant 20 February 2025
The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:

> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

eg: “the study of linguistics doesn’t help you build an LLM” or “you don’t need to know about chicken physiology to make a vision system that tells you how old a chicken is”

The author then uses a narrow and _unusual_ definition of what computation _means_, by saying it simply means access to fast chips, rather than the work you can perform on them, which would obviously include how efficiently you use them.

In short, this article misuses two terms to more simply say “looks like the scaling laws still work”.

s1mplicissimus 20 February 2025
oh what a surprise, a new model performs better on barcharts than the old models. yaawn
GaggiX 20 February 2025
The bitter lesson is about the fact that general methods that leverage computation are ultimately the most effective. Grok 3 is not more general than DeepSeek or OpenAI models so mentioning the bitter lesson here doesn't make much sense, it's just the scaling law.
dubeye 20 February 2025
I use chat gpt for general brain dumping

I've compared my last week's queries and prefer Grok 3

graycat 20 February 2025
> Grok 3 performs at a level comparable to, and in some cases even exceeding, models from more mature labs like OpenAI, Google DeepMind, and Anthropic. It tops all categories in the LMSys arena and the reasoning version shows strong results—o3-level—in math,....

"Math"? Fields Medal level? Tenure? Ph.D.? ... high school plane geometry???

As in

'Grok 3 AI and Some Plane Geometry'

at

https://news.ycombinator.com/item?id=43113949

Grok 3 failed at a plane geometry exercise.

_giorgio_ 20 February 2025
Grok is the best LLM on https://lmarena.ai/.

---

No benchmarks involved, just user preference.

Rank* (UB)Rank (StyleCtrl)ModelArena Score95% CIVotesOrganizationLicense 1 1

chocolate (Early Grok-3) 1402 +7/-6 7829 xAI Proprietary 2 4

Gemini-2.0-Flash-Thinking-Exp-01-21 1385 +5/-5 13336 Google Proprietary 2 2

Gemini-2.0-Pro-Exp-02-05 1379 +5/-6 11197 GoogleProprietary

cowpig 20 February 2025
I haven't seen grok3 on any benchmark leaderboard other than lm arena. Has anyone else?
sylware 20 February 2025
Is the next step ML-inference fusion? aka artificial small brain?
readthenotes1 20 February 2025
I had to ask Grok 3 what the bitter lesson was. It gave a plausible answer (compute scale beats human cleverness)
vasco 20 February 2025
That's not what "the exception that proves the rule" means.