Something weird is happening with LLMs and Chess

(dynomight.net)

Comments

songeater 8 hours ago
Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?

This is not "cheating" in my opinion... in general better for LLMs to know when to call certain functions, etc.

bob1029 8 hours ago
This intensifies my theory that some of the older OAI models are far more capable than advertised but in ways that are difficult to productize.

How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?

danjl 6 hours ago
Perhaps the tokenizer is the entire problem? I wonder how the chat-based LLMs would perform if you explained the move with text, rather than chess notation? I can easily imagine that the "instruct" LLM uses a specialized tokenizer designed to process instructions, similar in many ways to chess notation.
wenc 4 hours ago
A recent discovery showing that computer chess (which was traditionally based on searching a combinatorial space, i.e. NP-hard) that is instead now being solved with transformer models, is actually playing better at the ELO level.

https://arxiv.org/pdf/2402.04494

If you think about using search to play chess, it can go several ways.

Brute-forcing all chess moves (NP hard) doesn’t work because you need almost infinite compute power.

If you use a chess engine with clever heuristics to eliminate bad solutions, you can solve it in finite time.

But if you learn from the best humans performing under different contexts (transformers are really good at capturing context in sequences and predicting the next token from that context — hence their utility in LLMs) you have narrowed your search space even further to only set of good moves (by grandmaster standards).

A_No_Name_Mouse 8 hours ago
Wouldn't surprise me if the outcome would have been exactly the same if instead of LLM's this was tried with a random selection of human beings :-)
araes 5 hours ago
Anecdotally, I found the same in terms of art and text output in interview conversations with college students at a bar in my local area. The problem does not appear to be localized to chess output. Went something like:

"Wow, these Chatbots are amazing, look at this essay or image it made me!." (shows phone)

"Although, that next ChatGPT seems lobotomized. Don't know how to make it give me stuff as cool as what it made before."

Tenoke 5 hours ago
I do wonder how much of this is the prompt. Gpt-4 was doing really well against me when it launched but my prompt was quite different
zamalek 5 hours ago
I wonder how a transformer (even an existing LLM architecture) would do if it was trained purely on chess moves - no language at all. The limited vocabulary would also be fantastic for training time, as the network would be inherently smaller.
jameshart 5 hours ago
Strongly implied but not explicitly stated in here - all these LLMs were able to consistently generate legal moves? All the way to the end of a game?

Seems noteworthy enough in itself, before we discuss their performance as chess players.

csours 7 hours ago
My comment from substack:

Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.

To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.

All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.

I think you can replace LLM with human in the previous paragraph and not be too far wrong.

superkuh 6 hours ago
There is an obvious difference on the openai side. gpt-3.5-turbo-instruct is the only remaining decent model with raw text completion API access (RIP text-davinci-003 and code-davinci-002). All the others are only available in an abstract fashion through the wrapper that is the "system/role" API.

I still use gpt-3.5-turbo-instruct a lot because the raw text completion is so much more powerful than the system/role abstraction. With the system/role abstraction you literally cannot present the text you want to the model and have it go. It's always wrapped in openai-junk prompt you can't see or know about (and one that allows openai to cache their static pre-prompts internally to better share resources versus just allowing users to decide what the model sees).

scotty79 8 hours ago
> Theory 2: GPT-3.5-instruct was trained on more chess games.

Alternatively somebody who prepared training materials for this specific ANN had some spare time and decided to preprocess them so that during training the model was only asked to predict movements of the winning player and that individual whimsy was never repeated in training of any other model.

jtbayly 5 hours ago
Some people are good at playing chess, and others aren’t.
OutOfHere 5 hours ago
Should've tuned the LLM for chess.