Show HN: Trawl – LLM-powered web scraper that calls the AI once - runs pure Go

2 points

3 months ago

Every scraper I've written has the same failure mode: it works for three months, a site redesigns, and my CSS selectors silently return empty strings. The data is still right there on the page — a human can find it instantly — but the scraper is blind.

Trawl fixes this by splitting the problem. You describe what you want:

    trawl "https://books.toscrape.com" --fields "title, price, rating, in_stock"

The LLM (Claude) looks at one sample item and derives a full extraction strategy — CSS selectors, attribute mappings, type coercion, fallback selectors. That strategy gets cached. Every subsequent page with the same structure is extracted with pure Go + goquery. No API calls, no token cost, full concurrency.

The key insight: LLMs are good at understanding HTML structure, but you don't need them to extract 10,000 rows. Use AI for intelligence, Go for throughput.

When a site redesigns, the structural fingerprint changes, the cache misses, and trawl re-derives automatically.

You can preview exactly what it figured out:

    $ trawl "https://example.com/products" --fields "name, price" --plan

    Strategy for https://example.com/products
      Item selector: div.product-card
      Fields:
        name:  h2.product-title -> text (string)
        price: span.price -> text -> parse_price (float)
      Confidence: 0.95

Some things that took real engineering effort:

- JS-rendered SPAs: headless browser with DOM stability detection — polls until element count stabilizes and skeleton loaders resolve, scrolls to trigger lazy loading, auto-clicks "Show more" buttons - Multi-section pages: detects candidate data regions heuristically, target a specific section with --query "Market Share", scopes extraction via container selectors - Self-healing: monitors extraction health (% of fields populated), re-derives the strategy if it drops below 70% - Iframes: auto-detects and extracts from iframes when they contain richer data than the outer page

Output is JSON, JSONL, CSV, or Parquet. Pipes cleanly:

    trawl "https://example.com/products" --fields "name, price" --format jsonl | jq 'select(.price > 50)'

Written in Go. MIT licensed.

https://github.com/akdavidsson/trawl

1 comment

$ trawl "https://example.com/products" --fields "name, price" --plan Strategy for https://example.com/products Item selector: div.product-card Fields: name: h2.product-title -> text (string) price: span.price -> text -> parse_price (float) Confidence: 0.95