We're leveraging LLMs (GPT and fine-tuned Flan-T5) to semantically understand websites and generate the DOM selectors for it. Using GPT for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient. Extracting and transforming data is one of the underestimated superpowers of LLMs.
Try it out on our playground https://kadoa.com/playground (OpenAI key required) and let me know what you think!
Here is a rather impressive example of extracting gaming stats: https://www.kadoa.com/playground?session=e1226fa8-9fa2-4eba-...
There is still a lot of work ahead of us. Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast:
* Ensuring data accuracy (veryifying that the data is on the website, adapting to website changes, etc.)
* Handling large data volumes
* Managing proxy infrastructure
* Elements of RPA to automate scraping tasks like pagination, login, and form-filling
We are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.