Some interesting results have emerged.
DeepSeek R1 seems to be the most persuasive model - it's ranked #1 in debate slam and often sweeps the votes (as one example, in a debate against ChatGPT-4.5 it convinced all of the judges both for and against genetic engineering). DeepSeek R1 is also the current poetry slam champion, by quite a lot. Its poems are also often unanimous favorites. I'm not sure if this constitutes "creativity" per se or more like a different flavor of persuasion, but either way it seems impressive. I've read some of its poems and find them to be beautiful.
Grok-2, meanwhile, is the current champion in prisoner's dilemma. It seems to be able to find the optimal time to defect in order to optimize its score (it is the first defector in 90% of its games).
This is, to my knowledge, the only open-source benchmark of its kind. I think the open part is important, because it means the methodology and results are verifiable and reproducible. It also means (I hope) that others can jump in to contribute, either by adding new games, coming up with new ways to analyze and visualize the results, or by providing feedback. This has a lot of room to grow.
I'm open to any and all critiques and feedback. And if you'd like to contribute please visit the project on github: https://github.com/jmogielnicki/llmshowdown
Cheers, John