Show HN: I built an open-source benchmark that evaluates LLMs through gameplay

2 points

a year ago

I built an open-source framework that evaluates LLM models using competitive gameplay. So far we have 3 games - a debate contest where LLMs try to persuade each other of different positions, a poetry slam where they judge each others' creativity, and a simple strategy game of cooperation and defection based on the prisoner's dilemma. The idea is that by pitting models against one another and evaluating their relative strengths we can scale the benchmark with model capability improvements.

Some interesting results have emerged.

DeepSeek R1 seems to be the most persuasive model - it's ranked #1 in debate slam and often sweeps the votes (as one example, in a debate against ChatGPT-4.5 it convinced all of the judges both for and against genetic engineering). DeepSeek R1 is also the current poetry slam champion, by quite a lot. Its poems are also often unanimous favorites. I'm not sure if this constitutes "creativity" per se or more like a different flavor of persuasion, but either way it seems impressive. I've read some of its poems and find them to be beautiful.

Grok-2, meanwhile, is the current champion in prisoner's dilemma. It seems to be able to find the optimal time to defect in order to optimize its score (it is the first defector in 90% of its games).

This is, to my knowledge, the only open-source benchmark of its kind. I think the open part is important, because it means the methodology and results are verifiable and reproducible. It also means (I hope) that others can jump in to contribute, either by adding new games, coming up with new ways to analyze and visualize the results, or by providing feedback. This has a lot of room to grow.

I'm open to any and all critiques and feedback. And if you'd like to contribute please visit the project on github: https://github.com/jmogielnicki/llmshowdown

Cheers, John