Trawl fixes this by splitting the problem. You describe what you want:
trawl "https://books.toscrape.com" --fields "title, price, rating, in_stock"
The LLM (Claude) looks at one sample item and derives a full extraction strategy — CSS selectors, attribute mappings, type coercion, fallback selectors. That strategy gets cached. Every subsequent page with the same structure is extracted with pure Go + goquery. No API calls, no token cost, full concurrency.The key insight: LLMs are good at understanding HTML structure, but you don't need them to extract 10,000 rows. Use AI for intelligence, Go for throughput.
When a site redesigns, the structural fingerprint changes, the cache misses, and trawl re-derives automatically.
You can preview exactly what it figured out:
$ trawl "https://example.com/products" --fields "name, price" --plan
Strategy for https://example.com/products
Item selector: div.product-card
Fields:
name: h2.product-title -> text (string)
price: span.price -> text -> parse_price (float)
Confidence: 0.95
Some things that took real engineering effort:- JS-rendered SPAs: headless browser with DOM stability detection — polls until element count stabilizes and skeleton loaders resolve, scrolls to trigger lazy loading, auto-clicks "Show more" buttons - Multi-section pages: detects candidate data regions heuristically, target a specific section with --query "Market Share", scopes extraction via container selectors - Self-healing: monitors extraction health (% of fields populated), re-derives the strategy if it drops below 70% - Iframes: auto-detects and extracts from iframes when they contain richer data than the outer page
Output is JSON, JSONL, CSV, or Parquet. Pipes cleanly:
trawl "https://example.com/products" --fields "name, price" --format jsonl | jq 'select(.price > 50)'
Written in Go. MIT licensed.https://github.com/akdavidsson/trawl