What if in robots.txt, there was a way to specify which query parameters in a URL are unique to the content?
e.g.
UniqueContentURLParameters: "articleID","node"
would mean that dupe checkers would know to ignore those parameters that don't match any of the items in the list. http://news.ycombinator.com/robots.txt:
UniqueContentURLParameters: "id"
http://nytimes.com/robots.txt:
UniqueContentURLParameters: none
Of course, this doesn't solve the problem of bad URL design, such as the BBC news site, where you have http://news.bbc.co.uk/2/hi/asia-pacific/7391008.stm
and http://newsvote.bbc.co.uk/mpapps/pagetools/print/news.bbc.co.uk/2/hi/asia-pacific/7391008.stm
I guess you could specify a list of regexes?Alternatively to a robots.txt, I could see a shared database for linksharing/social news sites to accumulate this information.
How would you solve this problem?