It's always been a pain to make documents (PDFs, Word, Excel, etc) displayable and searchable on the web by extracting images and plain text. Docsplit is a command-line utility and Ruby API to help make it a little easier. It wraps the excellent PDFBox, GraphicsMagick, and JODConverter libraries so that you can do things like this:
docsplit images docs/*.pdf --size 700x,50x50 --format gif
docsplit text expenses.doc
docsplit title presentation.ppt