Basically you can define a test in natural language steps, run the test, and the LLM will carry out the test case by observing browser screenshots and taking actions. If it runs into issues or bugs the test case fails with a descriptive problem. We aim to eliminate the flakiness of existing paradigms like playwright/selenium by letting the LLM decide what to do based on what it observes to follow the test case. This way if you make a feature change or some selectors change around, the test case should still work and you don't have to update tests all the time.