Ask HN: Where can I find tf–idf for English?

3 points

17 years ago

I want to calculate weights for the terms appearing in webpages, and for that I require tf–idf (term frequency–inverse document frequency), which when multiplied by term frequency in the webpage gives how important is the term in characterizing it.

The problem is that for a set of documents you can calculate tf-idf by counting frequencies in all documents, but this is not possible with webpages as the Internet has nearly infinite English webpages. To solve this problem, I am considering two approaches:

1. Scraping the number of results returned by Google for a term and taking that as a proxy 2. Using Wikipedia as a proxy for the whole Internet

The problem with first is that it is not scalable and it is against Google's TOS. 2nd approach is more tractable but the Wikipedia dump (http://static.wikipedia.org/) is about 14G zipped (it included images, which I don't require), which I guess is huge.

Does anyone know any processed list of such form? Any English corpus with term frequencies? If not, I guess rather than processing all Wikipedia, a better approach would be to crawl a (random?) subset of Wikipedia pages and process them. Any suggestions or tips?

5 comments