What is industry best-practice for indexing and searching through these documents?
Does anyone have any first-hand experience with commercial and/or open-source solutions that are performant on this type of problem?
How about experience with ranking algos other than tf-idf?
I'm sure this is a common problem at the enterprise level. For example, just today, the WSJ published an article mentioning that over "100,000 pages of documents [have been collected from] the companies and agencies involved in the" oil spill: http://online.wsj.com/article/SB10001424052748703339304575240210545113710.html?mod=WSJ_hpp_LEFTTopStories
What is the quick, easy, and reliable way to search through these single-serving corpora?