This intensifies my theory that some of the older OAI models are far more capable than advertised but in ways that are difficult to productize.
How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?
Perhaps the tokenizer is the entire problem? I wonder how the chat-based LLMs would perform if you explained the move with text, rather than chess notation? I can easily imagine that the "instruct" LLM uses a specialized tokenizer designed to process instructions, similar in many ways to chess notation.
A recent discovery showing that computer chess (which was traditionally based on searching a combinatorial space, i.e. NP-hard) that is instead now being solved with transformer models, is actually playing better at the ELO level.
If you think about using search to play chess, it can go several ways.
Brute-forcing all chess moves (NP hard) doesn’t work because you need almost infinite compute power.
If you use a chess engine with clever heuristics to eliminate bad solutions, you can solve it in finite time.
But if you learn from the best humans performing under different contexts (transformers are really good at capturing context in sequences and predicting the next token from that context — hence their utility in LLMs) you have narrowed your search space even further to only set of good moves (by grandmaster standards).
Anecdotally, I found the same in terms of art and text output in interview conversations with college students at a bar in my local area. The problem does not appear to be localized to chess output. Went something like:
"Wow, these Chatbots are amazing, look at this essay or image it made me!." (shows phone)
"Although, that next ChatGPT seems lobotomized. Don't know how to make it give me stuff as cool as what it made before."
I wonder how a transformer (even an existing LLM architecture) would do if it was trained purely on chess moves - no language at all. The limited vocabulary would also be fantastic for training time, as the network would be inherently smaller.
Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.
To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.
All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.
I think you can replace LLM with human in the previous paragraph and not be too far wrong.
There is an obvious difference on the openai side. gpt-3.5-turbo-instruct is the only remaining decent model with raw text completion API access (RIP text-davinci-003 and code-davinci-002). All the others are only available in an abstract fashion through the wrapper that is the "system/role" API.
I still use gpt-3.5-turbo-instruct a lot because the raw text completion is so much more powerful than the system/role abstraction. With the system/role abstraction you literally cannot present the text you want to the model and have it go. It's always wrapped in openai-junk prompt you can't see or know about (and one that allows openai to cache their static pre-prompts internally to better share resources versus just allowing users to decide what the model sees).
> Theory 2: GPT-3.5-instruct was trained on more chess games.
Alternatively somebody who prepared training materials for this specific ANN had some spare time and decided to preprocess them so that during training the model was only asked to predict movements of the winning player and that individual whimsy was never repeated in training of any other model.
Something weird is happening with LLMs and Chess
(dynomight.net)117 points by gregorymichael 9 hours ago | 49 comments
Comments
This is not "cheating" in my opinion... in general better for LLMs to know when to call certain functions, etc.
How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?
https://arxiv.org/pdf/2402.04494
If you think about using search to play chess, it can go several ways.
Brute-forcing all chess moves (NP hard) doesn’t work because you need almost infinite compute power.
If you use a chess engine with clever heuristics to eliminate bad solutions, you can solve it in finite time.
But if you learn from the best humans performing under different contexts (transformers are really good at capturing context in sequences and predicting the next token from that context — hence their utility in LLMs) you have narrowed your search space even further to only set of good moves (by grandmaster standards).
"Wow, these Chatbots are amazing, look at this essay or image it made me!." (shows phone)
"Although, that next ChatGPT seems lobotomized. Don't know how to make it give me stuff as cool as what it made before."
Seems noteworthy enough in itself, before we discuss their performance as chess players.
Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.
To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.
All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.
I think you can replace LLM with human in the previous paragraph and not be too far wrong.
I still use gpt-3.5-turbo-instruct a lot because the raw text completion is so much more powerful than the system/role abstraction. With the system/role abstraction you literally cannot present the text you want to the model and have it go. It's always wrapped in openai-junk prompt you can't see or know about (and one that allows openai to cache their static pre-prompts internally to better share resources versus just allowing users to decide what the model sees).
Alternatively somebody who prepared training materials for this specific ANN had some spare time and decided to preprocess them so that during training the model was only asked to predict movements of the winning player and that individual whimsy was never repeated in training of any other model.