The benchmark features 387 carefully designed milestones (reaching locations, catching Pokemon, earning badges) with assigned difficulty scores to create a standardized evaluation framework.
Our initial testing revealed an interesting performance gap: amateur human players require ~400 steps to catch their first Pokemon, while Claude 3.7 needs ~450 steps - suggesting AI models are approaching human-level performance in this domain.
The benchmark will soon be available on benchflow.ai with a simple API for testing your own agents and models.
GitHub repo: https://github.com/benchflow-ai/pokemon-gym
We're looking for collaborators interested in improving the harness or running experiments with different models.