Ask HN: A bit of direction from people who might know. Crawling/screen scraping

1 point

13 years ago

I have been working on a tool to try to 'train' a crawler to extract specific elements of a page. I could do with some advice on where to take it from here. here's how it currently works:

1) It has a queue of domains that I have pre-processed. For the initial purposes I've restricted it to pages that I think are ecommerce based on $ signs, add to cart/basket type links etc

2) There is a visual tool that I then use to select certain parts of the page - eg price, product, image etc. I save these out as xpaths

3) Once I have done one URL I send a crawler to that domain and extract other pages that fit the profile of an ecommerce page and try to use the same mapping as number 2 above to extract the data

I have done a small video to show it in action:

http://www.screencast.com/t/riB3iiVMiSk

I'm not sure if I'm doing this the right way. If a site/page changes structure then I may have to re-map the data. I was hoping that someone would have some pointers for me in terms of any other ways to do this. Also with Javascript-heavy sites I've had some problems

If anyone has any knowledge of screen scraping, where it can be done more automatically, I'd really appreciate a steer!

Thanks

Ade