Show HN: I built an open-source benchmark that evaluates LLMs through gameplay | Heykuki News