What does HackerNews think of tabula?
Tabula is a tool for liberating data tables trapped inside PDF files
https://github.com/tabulapdf/tabula
Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.
The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula
However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.
It's a flexible way of writing one-off applications; you can run them locally, remotely, or on someone else's machine in the cloud. One useful example is Tabula[1], a browser-based utility for extracting tabular data from PDFs. As it is often used by journalists and other organizations that don't want to leak the data they are analyzing all over the place, it is easy to run locally instead of uploading files to their website. You just point the browser to 'localhost:port' while the server is running.
Feel free to reachme at manuel at jazzido dot com
If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.
> Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.