Measures the ability of various LLMs to navigate a fictional codebase via iterative directory tree expansion and observation.
Each model's baseline ability is compared against combinations of various prompt engineering mods to quantify exactly how much they help or hinder the LLM.