Show HN: Extracting structured data from the web with LLMs

4 points

2 years ago

Hey HN! Caleb, Nick, Garrett, and I from Mendable (YC S22) are excited to launch Intelligent Extraction for FireCrawl, the developer platform for scraping, search, and extraction.

After a successful twitter launch last week, FireCrawl skyrocketed to over 2k stars and we have been getting a ton of feature requests [1]. One that stood out to us in particular was using the data we scrape to extract different types of structured metadata. Think querying “Is this company open source?” to a list of URLs and getting structured JSON back.

Here’s a taste of what the request format / response looks like what scraping and extracting data from Firecrawl.dev

Request format: { "company_mission": { "type": "string" }, "supports_sso": { "type": "boolean" }, "is_open_source": { "type": "boolean" } }, "required": [ "company_mission", "supports_sso", "is_open_source" ] }

Response format: { "company_mission":"transform any website into clean, LLM-ready markdown", "Supports_sso":false, "Is_open_source":true }

The technical implementation for Intelligent Extraction involved: 1. Use Firecrawl to gather content as markdown 2. Use gpt-4 function calling to boil content down into a structured format. Inspiration was drawn from Simon Willison [2] and Mish Ushakov of llm-scraper [3]:

This is just the beginning as we launched FireCrawl about a week ago so we expect a great deal of work will be required to make this as reliable and extendable as we envision! Any feedback would be highly appreciated [4].

[1] https://github.com/mendableai/firecrawl [2] https://til.simonwillison.net/gpt3/openai-python-functions-d... [3] https://github.com/mishushakov/llm-scraper/ [4] https://console.algora.io/org/mendableai