What does HackerNews think of tabula?

Tabula is a tool for liberating data tables trapped inside PDF files

Language: CSS

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.

While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.

Some applications that don't need a complex GUI use web browsers as a frontend because they are cross-platform and they come with a bunch of 'free' UI elements like buttons, text boxes, sliders, etc. You can also style things pretty easily with CSS and JS, to a point.

It's a flexible way of writing one-off applications; you can run them locally, remotely, or on someone else's machine in the cloud. One useful example is Tabula[1], a browser-based utility for extracting tabular data from PDFs. As it is often used by journalists and other organizations that don't want to leak the data they are analyzing all over the place, it is easy to run locally instead of uploading files to their website. You just point the browser to 'localhost:port' while the server is running.

[1]: https://github.com/tabulapdf/tabula

Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.

Feel free to reachme at manuel at jazzido dot com

My favorite tool for extracting data from PDFs: Tabula https://github.com/tabulapdf/tabula
Sure. But the tool posted here doesn't do that. It merely extracts text, and the "analysis" is a couple of regexes that are tailor-made for that particular pdf. Awk can do that much and a lot more.

If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.

1. https://github.com/tabulapdf/tabula

Interesting. I've used Tabula [0] in the past with great success. I wonder how this compares.

[0]: https://github.com/tabulapdf/tabula

Unfortunately, it doesn't. From https://github.com/tabulapdf/tabula:

> Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.