I have been using tools like [Diffbot][1] to extract the text from HTML and therefore turn this into an NLP problem.
To build a model, I have managed to get the [ClueWeb12 dataset][2] and I'm planning to use the relevance judgments from [TREC Web Track 2014][3] and [2013][4] to train my model. These judgments are between -2 and 4. I am thresholding them to only have related/unrelated tags beforehand.
With this explained, I'm here asking for opinions/suggestions:
1. Is there another dataset I can use? 2. How should I start analyzing my data? Any ideas on how to choose the best model for this classifier?
[1]: http://www.diffbot.com/ "Diffbot"
[2]: http://boston.lti.cs.cmu.edu/clueweb12/
[3]: http://trec.nist.gov/data/web2014.html
[4]: http://trec.nist.gov/data/web2013.html