The problem is that for a set of documents you can calculate tf-idf by counting frequencies in all documents, but this is not possible with webpages as the Internet has nearly infinite English webpages. To solve this problem, I am considering two approaches:
1. Scraping the number of results returned by Google for a term and taking that as a proxy 2. Using Wikipedia as a proxy for the whole Internet
The problem with first is that it is not scalable and it is against Google's TOS. 2nd approach is more tractable but the Wikipedia dump (http://static.wikipedia.org/) is about 14G zipped (it included images, which I don't require), which I guess is huge.
Does anyone know any processed list of such form? Any English corpus with term frequencies? If not, I guess rather than processing all Wikipedia, a better approach would be to crawl a (random?) subset of Wikipedia pages and process them. Any suggestions or tips?