Something weird is happening with LLMs and chess Hackernews Viewer

Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points by crescit_eundo 14 November 2024 | 474 comments

Comments

swiftcoder 15 November 2024

I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.

a_wild_dandan 15 November 2024

Important testing excerpts:

- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."

- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"

- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"

- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."

Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!

azeirah 14 November 2024

Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

fabiospampinato 15 November 2024

It's probably worth to play around with different prompts and different board positions.

For context this [1] is the board position the model is being prompted on.

There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.

More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.

[1]: https://i.imgur.com/qRxalgH.png

snickerbockers 15 November 2024

Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.

niobe 15 November 2024

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

cjbprime 15 November 2024

> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.

It's just a lossy compression of all of the parameters, probably not important, right?

jrecursive 14 November 2024

i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number

that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.

while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.

anotherpaulg 15 November 2024

I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.

Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.

https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

dr_dshiv 15 November 2024

OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)

codeflo 15 November 2024

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

underlines 14 November 2024

Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"

Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.

Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.

One could try using DSPy for automatic prompt optimization.

PaulHoule 14 November 2024

Maybe that one which plays chess well is calling out to a real chess engine.

chvid 15 November 2024

Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.

lukev 15 November 2024

I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.

OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.

Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.

So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?

For the record, I don't actually believe this. But given the data it's a logical possibility.

quantadev 15 November 2024

We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.

It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.

Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.

Miraltar 15 November 2024

related : Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task https://arxiv.org/abs/2210.13382

Chess-GPT's Internal World Model https://adamkarvonen.github.io/machine_learning/2024/01/03/c... discussed here https://news.ycombinator.com/item?id=38893456

ericye16 15 November 2024

I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.

fsndz 15 November 2024

wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out

bryan0 14 November 2024

I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.

Havoc 15 November 2024

My money is on a fluke inclusion of more chess data in that models training.

All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation

sourcepluck 15 November 2024

Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.

But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.

ChrisArchitect 14 November 2024

[dupe] https://news.ycombinator.com/item?id=42138276

digging 14 November 2024

Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:

1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?

2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?

Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...

osaatcioglu 15 November 2024

I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.

The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.

Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.

AI Chess GPT https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978 https://play.google.com/store/apps/details?id=net.padma.app....

Thanks

mastazi 15 November 2024

If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)

> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!

well now we are curious!

golol 15 November 2024

My understanding of this is the following: All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.

ynniv 14 November 2024

I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?

cmpalmer52 14 November 2024

I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.

I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).

abalaji 15 November 2024

An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.

mips_avatar 15 November 2024

Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?

greatgib 16 November 2024

I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.

Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).

Peteragain 15 November 2024

I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run. The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.

layman51 15 November 2024

It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.

[1]: https://youtu.be/Gs3TULwlLCA

wufufufu 15 November 2024

> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.

How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.

Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.

stefatorus 15 November 2024

The trick to getting a model to perform on something is to have it as a training data subset.

OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.

This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.

Good models are about the training data you push in em

ConspiracyFact 15 November 2024

"...And how to construct that state from lists of moves in chess’s extremely confusing notation?"

Algebraic notation is completely straightforward.

sylware 15 November 2024

They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.

That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...

tqi 14 November 2024

I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...

peter_retief 15 November 2024

It makes me wonder about other games? If LLM's are bad at games then the would be bad at solving problems in general?

astrea 15 November 2024

Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.

ks2048 15 November 2024

Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.

jacknews 15 November 2024

Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.

philipwhiuk 15 November 2024

> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.

Okay, so "Excellent" still means probably quite bad. I assume at the top difficult setting gpt-3.5-turbo-instruct will still lose badly.

justinclift 15 November 2024

It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)

Or maybe it's able to recognise the chess game, then get moves from an external chess game API?

misiek08 15 November 2024

For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.

m3kw9 15 November 2024

If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players

Xcelerate 14 November 2024

So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?

leogao 15 November 2024

The GPT-4 pretraining set included chess games in PGN notation from 1800+ ELO players. I can't comment on any other models.

Sparkyte 15 November 2024

Lets be real though most people can't beat a grandmaster. It is impressive to see it last more rounds as it progressed.

smokedetector1 15 November 2024

I feel like an easy win here would be retraining an LLM with a tokenizer specifically designed for chess notation?

amelius 15 November 2024

What would happen if you'd prompted it with much more text, e.g. general advice by a chess grandmaster?

stockboss 15 November 2024

perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?

jack_riminton 15 November 2024

Perhaps if it doesn't have enough data to explain but it has enough to go "on gut"

XCSme 15 November 2024

I had the same experience with LLM text-to-sql, 3.5 instruct felt a lot more robust than 4o

ks2048 15 November 2024

How well does an LLM/transformer architecture trained purely on chess games do?

amelius 15 November 2024

I wonder if the llm could even draw the chess board in ASCII if you asked it to.

reallyeli 15 November 2024

My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.

People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.

Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.

uneventual 15 November 2024

my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.

dr_dshiv 15 November 2024

Has anyone tested a vision model? Seems like they might be better

1024core 15 November 2024

I would love to see the prompts (the data) this person used.

throwawaymaths 15 November 2024

Would be more interesting with trivial Lora training

downboots 15 November 2024

In a sense, a chess game is also a dialogue

DrNosferatu 14 November 2024

What about contemporary frontier models?

nusl 15 November 2024

> I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys

Sure, but nobody is required to send you anything for free.

davvid 15 November 2024

Here is a truly brilliant game. It's Google Bard vs. Chat GPT. Hilarity ensues.

https://www.youtube.com/watch?v=FojyYKU58cw

nabla9 15 November 2024

Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.

gunalx 15 November 2024

Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo? It seemed like the best model was labeled instruct, but the text saying instruct did worse?

permo-w 15 November 2024

if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so

teleforce 15 November 2024

TL;DR.

All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.

kmeisthax 15 November 2024

If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.

pseudosavant 14 November 2024

LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.

Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.