Ask HN / Review my startup: DocuHarvest

34 points

16 years ago

DocuHarvest extracts data from documents (PDFs only for now) in a way that is accessible to nontechnical users and economical for small workflows or one-off jobs.

https://docuharvest.com

The original launch announcement is on my blog here, which provides more background:

"Getting valuable data out of documents should not require an I.T. staff, outside consultants, building or buying software, or an up-front investment of hundreds or thousands of dollars, regardless of how many documents and how much data is involved."

http://muckandbrass.com/web/x/CwBi

To answer probably the three most common questions:

1. Yes, we have a HTTP API coming, probably along with client libraries for some subset of {Java, Python, Ruby, PHP, C#/.NET, ...}.

2. More job types are incoming. As noted in the announcement and on the site's front page, imaging-related jobs are up next. Lots more in the works after that as well.

3. DocuHarvest is largely written in Clojure, and currently uses CouchDB as a backend. More info here: http://groups.google.com/group/clojure/browse_frm/thread/c1c11390caac3dc

SMB is my initial focus (intentionally wide for now). Good potential verticals include legal & public records, medical, finance & accounting. That's all up for grabs depending on how forthcoming job types are received and by whom.

Feedback and suggestions are most welcome, either here or via the feedback boxes on the DocuHarvest site, twitter messages @docuharvest or @cemerick, or you can email [email protected] or [email protected].

Thanks!

20 comments