Ask HN: How to approach writing a web crawler that can handle JavaScript?

4 points

10 years ago

Greetings,

I am interested in writing a web crawler that can handle JavaScript, i.e. that can access the DOM after any JavaScript has run.

I recognize that this could get arbitrarily complicated; however, I wanted to know whether anyone had any obvious pointers. There do seem to be some nice Java--my preferred language--crawlers out there, i.e. https://github.com/yasserg/crawler4j. However, they of course do not handle JS.

Is the standard approach to handling JS to use something like Selenium? i.e. load the page in a browser and then pull the DOM into the crawler for processing?

Thanks.

3 comments