Ask HN: Building a related/unrelated classifier in Adhoc retrieval

2 points

9 years ago

I want to build a classifier that judges an HTML page as related/unrelated to a query.

I have been using tools like [Diffbot][1] to extract the text from HTML and therefore turn this into an NLP problem.

To build a model, I have managed to get the [ClueWeb12 dataset][2] and I'm planning to use the relevance judgments from [TREC Web Track 2014][3] and [2013][4] to train my model. These judgments are between -2 and 4. I am thresholding them to only have related/unrelated tags beforehand.

With this explained, I'm here asking for opinions/suggestions:

1. Is there another dataset I can use? 2. How should I start analyzing my data? Any ideas on how to choose the best model for this classifier?

  [1]: http://www.diffbot.com/ "Diffbot"
  [2]: http://boston.lti.cs.cmu.edu/clueweb12/
  [3]: http://trec.nist.gov/data/web2014.html
  [4]: http://trec.nist.gov/data/web2013.html