Something weird is happening with LLMs and Chess

(dynomight.net)

Comments

songeater 14 November 2024
Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?

This is not "cheating" in my opinion... in general better for LLMs to know when to call certain functions, etc.

superkuh 14 November 2024
There is an obvious difference on the openai side. gpt-3.5-turbo-instruct is the only remaining decent model with raw text completion API access (RIP text-davinci-003 and code-davinci-002). All the others are only available in an abstract fashion through the wrapper that is the "system/role" API.

I still use gpt-3.5-turbo-instruct a lot because the raw text completion is so much more powerful than the system/role abstraction. With the system/role abstraction you literally cannot present the text you want to the model and have it go. It's always wrapped in openai-junk prompt you can't see or know about (and one that allows openai to cache their static pre-prompts internally to better share resources versus just allowing users to decide what the model sees).

bob1029 14 November 2024
This intensifies my theory that some of the older OAI models are far more capable than advertised but in ways that are difficult to productize.

How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?

danjl 14 November 2024
Perhaps the tokenizer is the entire problem? I wonder how the chat-based LLMs would perform if you explained the move with text, rather than chess notation? I can easily imagine that the "instruct" LLM uses a specialized tokenizer designed to process instructions, similar in many ways to chess notation.
scotty79 14 November 2024
> Theory 2: GPT-3.5-instruct was trained on more chess games.

Alternatively somebody who prepared training materials for this specific ANN had some spare time and decided to preprocess them so that during training the model was only asked to predict movements of the winning player and that individual whimsy was never repeated in training of any other model.

A_No_Name_Mouse 14 November 2024
Wouldn't surprise me if the outcome would have been exactly the same if instead of LLM's this was tried with a random selection of human beings :-)
pdm55 17 November 2024
I don't think LLMs are suitable for chess. Witness the result of a chess game this year between ChatGPT (OpenAI) and Gemini (Google AI). The LLMs had a very poor record at picking legal moves, let alone good moves:

"Gemini's mistakes and ChatGPT's misses tell a lot of the story. One AI kept giving the other one opportunities, the other kept refusing those gifts. The good news for ChatGPT is that it made more "Good" or better moves than Gemini made mistakes and blunders. The bad news for Gemini is, well, almost everything that happened. ...

The final illegal move tally was Gemini 32, ChatGPT 6. That makes sense; it would have been crazy if the AI good enough to win was also bad enough to make more illegal moves. But it also means Gemini only went 50% in picking a legal move, while ChatGPT was over 80%."

https://www.chess.com/article/view/chatgpt-gemini-play-chess

wenc 14 November 2024
A recent discovery showing that computer chess (which was traditionally based on searching a combinatorial space, i.e. NP-hard) that is instead now being solved with transformer models, is actually playing better at the ELO level.

https://arxiv.org/pdf/2402.04494

If you think about using search to play chess, it can go several ways.

Brute-forcing all chess moves (NP hard) doesn’t work because you need almost infinite compute power.

If you use a chess engine with clever heuristics to eliminate bad solutions, you can solve it in finite time.

But if you learn from the best humans performing under different contexts (transformers are really good at capturing context in sequences and predicting the next token from that context — hence their utility in LLMs) you have narrowed your search space even further to only set of good moves (by grandmaster standards).

rajnathani 15 November 2024
araes 14 November 2024
Anecdotally, I found the same in terms of art and text output in interview conversations with college students at a bar in my local area. The problem does not appear to be localized to chess output. Went something like:

"Wow, these Chatbots are amazing, look at this essay or image it made me!." (shows phone)

"Although, that next ChatGPT seems lobotomized. Don't know how to make it give me stuff as cool as what it made before."

Tenoke 14 November 2024
I do wonder how much of this is the prompt. Gpt-4 was doing really well against me when it launched but my prompt was quite different
zamalek 14 November 2024
I wonder how a transformer (even an existing LLM architecture) would do if it was trained purely on chess moves - no language at all. The limited vocabulary would also be fantastic for training time, as the network would be inherently smaller.
csours 14 November 2024
My comment from substack:

Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.

To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.

All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.

I think you can replace LLM with human in the previous paragraph and not be too far wrong.

jameshart 14 November 2024
Strongly implied but not explicitly stated in here - all these LLMs were able to consistently generate legal moves? All the way to the end of a game?

Seems noteworthy enough in itself, before we discuss their performance as chess players.

jtbayly 14 November 2024
Some people are good at playing chess, and others aren’t.
OutOfHere 14 November 2024
Should've tuned the LLM for chess.