Ask HN: Is there a great open source crawler?

7 points

haihai

16 years ago

I want to crawl a site, like foxnews.com and find all the URLs that match a pattern.

A pattern like:

http://www.foxnews.com/\w+/\d+/\d+/\d+/.*/<p>That would find all URLs like:

http://www.foxnews.com/world/2010/06/02/report-natalee-holloway-suspect-sought-murder-peru/

I know I could do this myself. I also know it's a seemingly easy problem, that is actually quite hairy.

I'm hoping there's an open source crawler that I can point to a start page and say "Find all the URLs that match this pattern.".

I know there are dozens of crawlers out there. I just don't know if there's one or two that are really great. I'm really hoping there's a modern/simple/fast one that would be good for this purpose.

Does such a thing exist? If not, are there any great documents detailing the common problems and how to solve them?

Thank you.

7 comments