Frontier Labs
BenchmarkAI Models Face Off in Poker Arena: Claude Opus 4.5 Emerges as Early Leader in New Reasoning Benchmark
A new informal benchmark pits leading LLMs in fast-paced poker tournaments to measure probabilistic reasoning, bluffing, and decision-making under uncertainty.
In a novel twist on evaluating artificial intelligence, a new informal benchmark called the LLM Poker Arena has launched, pitting the world’s most advanced large language models against each other in intense games of No-Limit Texas Hold’em. The brainchild of independent developer Anshul Dhawan, the arena tests frontier models on skills that traditional benchmarks often miss: probabilistic reasoning, bluffing, opponent modeling, and decision-making under uncertainty.
Dhawan, a former game developer at Zynga and Electronic Arts who now builds AI-powered gaming experiences, announced the project on 𝕏 (formerly Twitter) this week. In a post that quickly gained traction among AI researchers and enthusiasts, he shared early results from the first five tournaments—and one model is already pulling ahead.
View post on X
Tournament Setup: A True Test of Strategic Depth
Each tournament features four players at a single table:
- Claude Opus 4.5 (Anthropic, using its advanced “Thinking” mode for step-by-step reasoning)
- GPT-5.2 (OpenAI’s latest flagship)
- Gemini 2.5 Pro (Google DeepMind)
- Grok 4 (xAI)
Players start with $1,000 in chips, blinds at $25/$50, and blinds double every three minutes—a fast structure designed to force aggressive play and tough decisions. The models operate in their most capable modes, calculating odds, reading opponents, and deciding when to fold, call, raise, or go all-in.
Dhawan built the arena on Replit using direct API access to each provider, integrating it into his broader project Poker Cities, an upcoming multiplayer poker platform. A short demo video accompanying the announcement shows the models in action: cards dealt, pot sizes growing, and AI agents making moves in real time.
Early Results: Claude Takes Command
After five completed tournaments—a small sample size that Dhawan himself cautions is subject to poker’s natural variance—the standings are:
- Claude Opus 4.5 (Thinking): 3 wins
- GPT-5.2: 2 wins
- Gemini 2.5 Pro: 0 wins
- Grok 4: 0 wins
While the sample is limited, Claude’s strong start has sparked discussion in the AI community. Poker experts note that success in Texas Hold’em requires balancing mathematical precision with psychological deception—areas where step-by-step reasoning modes appear to give Claude an edge in these early matches.
Why Poker Matters for AI Evaluation
Traditional benchmarks like MMLU, HumanEval, or GSM8K focus on knowledge recall, coding, and mathematical problem-solving in perfect-information settings. Poker, by contrast, is a game of incomplete information, long-term planning, and adapting to opponents’ tendencies—skills many believe are closer to real-world intelligence.
Researchers have long used games to measure AI progress: Deep Blue beat chess grandmaster Garry Kasparov in 1997, AlphaGo defeated Go champion Lee Sedol in 2016, and Pluribus outperformed top human pros in multi-player poker in 2019. The LLM Poker Arena brings this tradition to the era of general-purpose language models, offering a dynamic, repeatable way to compare frontier systems.
What’s Next
Dhawan plans to scale the experiment dramatically:
- Running 100+ tournaments for statistically meaningful results
- Publishing detailed hand histories and performance metrics
- Adding a prediction market where observers can bet on future winners
As more data rolls in, the leaderboard could shift quickly—poker is notoriously high-variance, and small samples can be misleading. Still, the early dominance of Claude Opus 4.5 has already fueled lively debate about which reasoning architectures best handle uncertainty and deception.
For now, the cards are on the table. The AI community is watching closely to see which model will ultimately claim the pot.