How to prevent duplicate submissions-- a standard way to specify URL equivalency?

2 points

18 years ago

Nytimes urls got me again this morning-- didn't mean to submit a dupe, and HN's dupe catcher didn't catch it.

What if in robots.txt, there was a way to specify which query parameters in a URL are unique to the content?

e.g.

  UniqueContentURLParameters: "articleID","node"

would mean that dupe checkers would know to ignore those parameters that don't match any of the items in the list.

  http://news.ycombinator.com/robots.txt:
  UniqueContentURLParameters: "id"

  http://nytimes.com/robots.txt:
  UniqueContentURLParameters: none

Of course, this doesn't solve the problem of bad URL design, such as the BBC news site, where you have

  http://news.bbc.co.uk/2/hi/asia-pacific/7391008.stm

and

  http://newsvote.bbc.co.uk/mpapps/pagetools/print/news.bbc.co.uk/2/hi/asia-pacific/7391008.stm

I guess you could specify a list of regexes?

Alternatively to a robots.txt, I could see a shared database for linksharing/social news sites to accumulate this information.

How would you solve this problem?

3 comments