A pattern like:
http://www.foxnews.com/\w+/\d+/\d+/\d+/.*/<p>That would find all URLs like:
http://www.foxnews.com/world/2010/06/02/report-natalee-holloway-suspect-sought-murder-peru/
I know I could do this myself. I also know it's a seemingly easy problem, that is actually quite hairy.
I'm hoping there's an open source crawler that I can point to a start page and say "Find all the URLs that match this pattern.".
I know there are dozens of crawlers out there. I just don't know if there's one or two that are really great. I'm really hoping there's a modern/simple/fast one that would be good for this purpose.
Does such a thing exist? If not, are there any great documents detailing the common problems and how to solve them?
Thank you.