Show HN: PokemonGym – 387 milestones designed to test agents and LLMs

1 point

a year ago

We've developed PokemonGym, an open-source benchmark that uses Pokemon gameplay to evaluate LLM capabilities in tool use, information extraction, and reasoning.

The benchmark features 387 carefully designed milestones (reaching locations, catching Pokemon, earning badges) with assigned difficulty scores to create a standardized evaluation framework.

Our initial testing revealed an interesting performance gap: amateur human players require ~400 steps to catch their first Pokemon, while Claude 3.7 needs ~450 steps - suggesting AI models are approaching human-level performance in this domain.

The benchmark will soon be available on benchflow.ai with a simple API for testing your own agents and models.

GitHub repo: https://github.com/benchflow-ai/pokemon-gym

We're looking for collaborators interested in improving the harness or running experiments with different models.