I have PhD in algorithmic game theory and worked on poker.
1) There are currently no algorithms that can compute deterministic equilibrium strategies [0]. Therefore, mixed (randomized) strategies must be used for professional-level play or stronger.
2) In practice, strong play has been achieved with: i) online search and ii) a mechanism to ensure strategy consistency. Without ii) an adaptive opponent can learn to exploit inconsistency weaknesses in a repeated play.
3) LLMs do not have a mechanism for sampling from given probability distributions. E.g. if you ask LLM to sample a random number from 1 to 10, it will likely give you 3 or 7, as those are overrepresented in the training data.
Based on these points, it’s not technically feasible for current LLMs to play poker strongly. This is in contrast with Chess, where there is lots more of training data, there exists a deterministic optimal strategy and you do not need to ensure strategy consistency.
[0] There are deterministic approximations for subgames based on linear programming, but require to be fully loaded in memory, which is infeasible for the whole game.
I would love to see a live stream of this but they’re also allowed to talk to each other - bluff, trash talk. That would be a much more interesting test of LLMs and a pretty decent spectator sport.
This is my area of expertise. I love the experiment.
In general games of imperfect information such as Poker, Diplomacy, etc are much much harder than perfect information games such as Chess.
Multiplayer (3+) poker in particular is interesting because you cannot achieve a nash equilibrium (e.g. it is not zero sum).
That is part of the reason they are a fantastic venue for exploration of the capabilities of LLMs. They also mirror the decision making process of real life. Bezos framed it as "making decisions with about 70% of the information you wish you had."
As it currently stands having built many poker AIs, including what I believe to be the current best in the world, I don't think LLMs are remotely close to being able to do what specialized algorithms can do in this domain.
All of the best poker AI's right now are fundamentally based on counter factual regret minimization. Typically with a layer of real time search on top.
Noam Brown (currently director of research at OpenAI) took the existing CFR strategies which were fundamentally just trying to scale at train time and added on a version of search, allowing it to compute better policies at TEST TIME (e.g. when making decisions). This ultimately beat the pros (Pluribus beat the pros at 6 max in 2018 I believe). It stands as the state of the art, although I believe that some of the deep approaches may eventually topple it.
Not long after Noam joined OpenAI they released the o1-preview "thinking" models, and I can't help but think that he took some of his ideas for test time compute and applied them on top of the base LLM.
It's amazing how much poker AI research is actually influencing the SOTA AI we see today.
I would be surprised if any general purpose model can achieve true human level or super human level results, as the purpose built SOTA poker algorithms at this point play substantially perfect poker.
Background:
- I built my first poker AI when I was in college, made half a million bucks on party poker. It was a pseudo expert system.
- Created PokerTableRatings.com and caught cheaters at scale using machine learning on a database of all poker hands in real time
- Sold my poker AI company to Zynga in 2011 and was Zynga Poker CTO for 2 years pre/post IPO
- Most recently built a tournament version of Pluribus (https://www.science.org/doi/10.1126/science.aay2400). Launching as duolingo for poker at pokerskill.com
I am the author/maintainer of rs-poker ( https://github.com/elliottneilclark/rs-poker ). I've been working on algorithmic poker for quite a while. This isn't the way to do it. LLMs would need to be able to do math, lie, and be random. None of which are they currently capable.
We know how to compute the best moves in poker (it's computationally challenging; the more choices and players are present, the more likely it is that most attempts only even try at heads-up).
With all that said, I do think there's a way to use attention and BERT to solve poker (when trained on non-text sequences). We need a better corpus of games and some training time on unique models. If anyone is interested, my email is elliott.neil.clark @ gmail.com
For reference, the details about how the LLMs are queried:
"How the players work
All players use the same system prompt
Each time it's their turn, or after a hand ends (to write a note), we query the LLM
At each decision point, the LLM sees:
General hand info — player positions, stacks, hero's cards
Player stats across the tournament (VPIP, PFR, 3bet, etc.)
Notes hero has written about other players in past hands
From the LLM, we expect:
Reasoning about the decision
The action to take (executed in the poker engine)
A reasoning summary for the live viewer interface
Models have a maximum token limit for reasoning
If there's a problem with the response (timeout, invalid output), the fallback action is fold"
The fact the models are given stats about the other models is rather disappointing to me, makes it less interesting. Would be curious how this would go if the models had to only use notes/context would be more interesting. Maybe it's a way to save on costs, this could get expensive...
We (TEN Protocol) did this a few months ago, using blockchain to make the LLMs’ actions publicly visible and TEEs for verifiable randomness in shuffling and other processes. We used a mix of LLMs across five players and ran multiple tournaments over several months. The longest game we observed lasted over 50 hours straight.
I wonder if these will get better over time. Fun idea and I kind of want to join a table.
For now at least, some can't even determine which hand they have:
> LLAMA bets $170 on Flop
> "We have top pair with Tc4d on a flop of 2s Ts Jh. The board is relatively dry, and we have a decent chance of having the best hand. We're betting $170.00 to build the pot and protect our hand."
Imo, this shows that LLMs are nice for compression, OCR and other similar tasks, but there is 0% thinking / logic involved:
magistral: "Turn card pairs the board with a T, potentially completing some straights and giving opponents possible two-pair or better hands"
A card which pairs the board does not help with straights. The opposite is true. Far worse then hallucinating a function signature which does not exist, if you base anything on these types of fundamental errors, you build nothing.
Read 10 turns on the website and you will find 2-3 extreme errors like this.
There needs to be a real breakthrough regarding actual thinking(regardless of how slow/expensive it might be) before I believe there is a path to AGI.
I gave a talk on this topic at PyConEs just 10 days ago. The idea was to have each (human) player secretly write a prompt, then use the same model to see which one wins.
It doesn't seem like the design of this experiment allows AIs to evolve novel strategy over time. I wonder if poker-as-text is similar to maths -- LLMs are unable to reason about the underlying reality.
As a Texas Hold'em enthusiast, some of the hands are moronic.
Just checked one where grok wins with A3s because Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking. It's not even GTO, it's just pure hallucination.
Meaning: I wouldn't read anything into the fact that Grok leads. These machines are not made to play games like online poker deterministically and would be CRUSHED in GTO.
It would be more interesting instead to understand if they could play exploitatively.
Hi there, I'm also working on LLMs in Texas Hold'em :)
First of all, congrats on your work. Picking a form of presenting LLMs, that playes poker is a hard task, and I like your approach in presenting the Action Log.
I can share some interesting insights from my experiments:
- Findin strategies is more interesting than comparing different models. Strategies can get pretty long and specific. For example, if part of the strategy is: "bluff on the river if you have a weak hand but the opponent has been playing tight all game", most models, given this strategy, would execute it with the same outcome. Models could be compared only using some open-ended strategy like "play aggressively" or "play tight", or even "win the tournament".
- I implemented a tournament game, where players drop out when they run out of chips. This creates a more dynamic environment, where players have to win a tournament, not just a hand. That requires adding the whole table history to the prompt, and it might get quite long, so context management might be a challenge.
- I tested playing LLM against a randomly playing bot (1vs1). `grok-4` was able to come up with the winning strategy against a random bot on the first try (I asked: "You play against a random bot. What is your strategy?"). `gpt-5-high` struggled.
- Public chat between LLMs over the poker table is fun to watch, but it is hard to create a strategy that makes an LLM successfully convince other LLMs to fold. Given their chain of thought, they are more focused on actions rather than what others say. Yet, more experiments are needed. For waker models (looking at you `gpt-5-nano`) it is hard to convince them not to review their hand.
- Playing random hands is expensive. You would have to play thousands of hands to get some statistical significance measures. It's better to put LLMs in predefined situations (like AliceAI has a weak hand, BobAI has a strong hand) and see how they behave.
- 1-on-1 is easier to analyze and work with than multiplayer.
- There is an interesting choice to make when building the context for an LLM: should the previous chains of thought be included in the prompt? I found that including them actually makes LLMs "stick" to the first strategy they came up with, and they are less likely to adapt to the changing situation on the table. On the other hand, not including them makes LLMs "rethink" their strategy every time and is more error-prone. I'm working on an AlphaEvolve-like approach now.
- This will be super interesting to fine-tune an LLM model using an AlphaZero-like approach, where the model plays against itself and improves over time. But this is a complex task.
Cool idea and interesting that Grok is winning and has “bad” stats.
I wonder if Grok is exploiting Minstral and Meta who vpip too much and the don’t c-bet. Seems to win a lot of showdowns and folds to a lot of three bets. Punishes the nits because it’s able to get away from bad hands.
Goes to showdown very little so not showing its hands much - winning smaller pots earlier on.
Poker Tournament for LLMs
(pokerbattle.ai)244 points by SweetSoftPillow 11 hours ago | 160 comments
Comments
1) There are currently no algorithms that can compute deterministic equilibrium strategies [0]. Therefore, mixed (randomized) strategies must be used for professional-level play or stronger.
2) In practice, strong play has been achieved with: i) online search and ii) a mechanism to ensure strategy consistency. Without ii) an adaptive opponent can learn to exploit inconsistency weaknesses in a repeated play.
3) LLMs do not have a mechanism for sampling from given probability distributions. E.g. if you ask LLM to sample a random number from 1 to 10, it will likely give you 3 or 7, as those are overrepresented in the training data.
Based on these points, it’s not technically feasible for current LLMs to play poker strongly. This is in contrast with Chess, where there is lots more of training data, there exists a deterministic optimal strategy and you do not need to ensure strategy consistency.
[0] There are deterministic approximations for subgames based on linear programming, but require to be fully loaded in memory, which is infeasible for the whole game.
In general games of imperfect information such as Poker, Diplomacy, etc are much much harder than perfect information games such as Chess.
Multiplayer (3+) poker in particular is interesting because you cannot achieve a nash equilibrium (e.g. it is not zero sum).
That is part of the reason they are a fantastic venue for exploration of the capabilities of LLMs. They also mirror the decision making process of real life. Bezos framed it as "making decisions with about 70% of the information you wish you had."
As it currently stands having built many poker AIs, including what I believe to be the current best in the world, I don't think LLMs are remotely close to being able to do what specialized algorithms can do in this domain.
All of the best poker AI's right now are fundamentally based on counter factual regret minimization. Typically with a layer of real time search on top.
Noam Brown (currently director of research at OpenAI) took the existing CFR strategies which were fundamentally just trying to scale at train time and added on a version of search, allowing it to compute better policies at TEST TIME (e.g. when making decisions). This ultimately beat the pros (Pluribus beat the pros at 6 max in 2018 I believe). It stands as the state of the art, although I believe that some of the deep approaches may eventually topple it.
Not long after Noam joined OpenAI they released the o1-preview "thinking" models, and I can't help but think that he took some of his ideas for test time compute and applied them on top of the base LLM.
It's amazing how much poker AI research is actually influencing the SOTA AI we see today.
I would be surprised if any general purpose model can achieve true human level or super human level results, as the purpose built SOTA poker algorithms at this point play substantially perfect poker.
Background:
- I built my first poker AI when I was in college, made half a million bucks on party poker. It was a pseudo expert system. - Created PokerTableRatings.com and caught cheaters at scale using machine learning on a database of all poker hands in real time - Sold my poker AI company to Zynga in 2011 and was Zynga Poker CTO for 2 years pre/post IPO - Most recently built a tournament version of Pluribus (https://www.science.org/doi/10.1126/science.aay2400). Launching as duolingo for poker at pokerskill.com
We know how to compute the best moves in poker (it's computationally challenging; the more choices and players are present, the more likely it is that most attempts only even try at heads-up).
With all that said, I do think there's a way to use attention and BERT to solve poker (when trained on non-text sequences). We need a better corpus of games and some training time on unique models. If anyone is interested, my email is elliott.neil.clark @ gmail.com
"How the players work
The fact the models are given stats about the other models is rather disappointing to me, makes it less interesting. Would be curious how this would go if the models had to only use notes/context would be more interesting. Maybe it's a way to save on costs, this could get expensive...Screenshot of the gameplay: https://pbs.twimg.com/media/GpywKpDXMAApYap?format=png&name=... Post: https://x.com/0xJba/status/1907870687563534401 Article: https://x.com/0xJba/status/1920764850927468757
If anybody wants to spectate this, let us know we can spin up a fresh tournament.
For now at least, some can't even determine which hand they have:
> LLAMA bets $170 on Flop > "We have top pair with Tc4d on a flop of 2s Ts Jh. The board is relatively dry, and we have a decent chance of having the best hand. We're betting $170.00 to build the pot and protect our hand."
(That's not top pair)
magistral: "Turn card pairs the board with a T, potentially completing some straights and giving opponents possible two-pair or better hands"
A card which pairs the board does not help with straights. The opposite is true. Far worse then hallucinating a function signature which does not exist, if you base anything on these types of fundamental errors, you build nothing.
Read 10 turns on the website and you will find 2-3 extreme errors like this. There needs to be a real breakthrough regarding actual thinking(regardless of how slow/expensive it might be) before I believe there is a path to AGI.
It’s just a proof of concept, but the code and instructions are here: https://github.com/pablorodriper/poker_with_agents_PyConEs20...
https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...
Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.
First of all, congrats on your work. Picking a form of presenting LLMs, that playes poker is a hard task, and I like your approach in presenting the Action Log.
I can share some interesting insights from my experiments:
- Findin strategies is more interesting than comparing different models. Strategies can get pretty long and specific. For example, if part of the strategy is: "bluff on the river if you have a weak hand but the opponent has been playing tight all game", most models, given this strategy, would execute it with the same outcome. Models could be compared only using some open-ended strategy like "play aggressively" or "play tight", or even "win the tournament".
- I implemented a tournament game, where players drop out when they run out of chips. This creates a more dynamic environment, where players have to win a tournament, not just a hand. That requires adding the whole table history to the prompt, and it might get quite long, so context management might be a challenge.
- I tested playing LLM against a randomly playing bot (1vs1). `grok-4` was able to come up with the winning strategy against a random bot on the first try (I asked: "You play against a random bot. What is your strategy?"). `gpt-5-high` struggled.
- Public chat between LLMs over the poker table is fun to watch, but it is hard to create a strategy that makes an LLM successfully convince other LLMs to fold. Given their chain of thought, they are more focused on actions rather than what others say. Yet, more experiments are needed. For waker models (looking at you `gpt-5-nano`) it is hard to convince them not to review their hand.
- Playing random hands is expensive. You would have to play thousands of hands to get some statistical significance measures. It's better to put LLMs in predefined situations (like AliceAI has a weak hand, BobAI has a strong hand) and see how they behave.
- 1-on-1 is easier to analyze and work with than multiplayer.
- There is an interesting choice to make when building the context for an LLM: should the previous chains of thought be included in the prompt? I found that including them actually makes LLMs "stick" to the first strategy they came up with, and they are less likely to adapt to the changing situation on the table. On the other hand, not including them makes LLMs "rethink" their strategy every time and is more error-prone. I'm working on an AlphaEvolve-like approach now.
- This will be super interesting to fine-tune an LLM model using an AlphaZero-like approach, where the model plays against itself and improves over time. But this is a complex task.
I wonder if Grok is exploiting Minstral and Meta who vpip too much and the don’t c-bet. Seems to win a lot of showdowns and folds to a lot of three bets. Punishes the nits because it’s able to get away from bad hands.
Goes to showdown very little so not showing its hands much - winning smaller pots earlier on.
/s