I am interested in writing a web crawler that can handle JavaScript, i.e. that can access the DOM after any JavaScript has run.
I recognize that this could get arbitrarily complicated; however, I wanted to know whether anyone had any obvious pointers. There do seem to be some nice Java--my preferred language--crawlers out there, i.e. https://github.com/yasserg/crawler4j. However, they of course do not handle JS.
Is the standard approach to handling JS to use something like Selenium? i.e. load the page in a browser and then pull the DOM into the crawler for processing?
Thanks.