Over the course of experimenting with pruned HTML, accessibility trees, and other perception systems for web agents, we've iterated on Tarsier's components to maximize downstream agent/codegen performance.
Here's the Tarsier pipeline in a nutshell:
1. tag interactable elements with IDs for the LLM to act upon & grab a full-sized webpage screenshot
2. for text-only LLMs, run OCR on the screenshot & convert it to whitespace-structured text (this is the coolest part imo)
3. map LLM intents back to actions on elements in the browser via an ID-to-XPath dict
Humans interact with the web through visually-rendered pages, and agents should too. We run Tarsier in production for thousands of web data extraction agents a day at Reworkd (https://reworkd.ai).
By the way, we're hiring backend/infra engineers with experience in compute-intensive distributed systems!