Show HN: Tarsier – Vision utilities for web interaction agents

192 points

2 years ago

Hey HN! I built a tool that gives LLMs the ability to understand the visual structure of a webpage even if they don't accept image input. We've found that unimodal GPT-4 + Tarsier's textual webpage representation consistently beats multimodal GPT-4V/4o + webpage screenshot by 10-20%, probably because multimodal LLMs still aren't as performant as they're hyped to be.

Over the course of experimenting with pruned HTML, accessibility trees, and other perception systems for web agents, we've iterated on Tarsier's components to maximize downstream agent/codegen performance.

Here's the Tarsier pipeline in a nutshell:

1. tag interactable elements with IDs for the LLM to act upon & grab a full-sized webpage screenshot

2. for text-only LLMs, run OCR on the screenshot & convert it to whitespace-structured text (this is the coolest part imo)

3. map LLM intents back to actions on elements in the browser via an ID-to-XPath dict

Humans interact with the web through visually-rendered pages, and agents should too. We run Tarsier in production for thousands of web data extraction agents a day at Reworkd (https://reworkd.ai).

By the way, we're hiring backend/infra engineers with experience in compute-intensive distributed systems!

https://reworkd.ai/careers

61 comments