Ask HN: How to search through 1M scanned TIF images?

11 points

16 years ago

Here's my problem: I've got a corpus of over 1 million scanned text documents in tif format. No OCR has been performed.

What is industry best-practice for indexing and searching through these documents?

Does anyone have any first-hand experience with commercial and/or open-source solutions that are performant on this type of problem?

How about experience with ranking algos other than tf-idf?

I'm sure this is a common problem at the enterprise level. For example, just today, the WSJ published an article mentioning that over "100,000 pages of documents [have been collected from] the companies and agencies involved in the" oil spill: http://online.wsj.com/article/SB10001424052748703339304575240210545113710.html?mod=WSJ_hpp_LEFTTopStories

What is the quick, easy, and reliable way to search through these single-serving corpora?

6 comments