@danso, if there are any delimiters in the output (tesseract case) and you are looking for automatic table extraction, check out http://github.com/ahirner/Tabularazr-os
It's been used with different kinds of financial docs such as municipal bonds. Implemented in pure python, it has a web interface, simple API and does nifty type inference (dates, interest rate, dollar ammounts...).
Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:
- http://tabula.technology/ (Java)
- https://github.com/jsvine/pdfplumber (pure Python as well)