What does HackerNews think of tabula?

So you want to modify the text of a PDF by hand (2020) | Sep 2023

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.

Pdfsandwich | Nov 2021

Expand Context ↕

While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.

A simple browser-based hexapod robot simulator built from first principles | Apr 2020

Expand Context ↕

Some applications that don't need a complex GUI use web browsers as a frontend because they are cross-platform and they come with a bunch of 'free' UI elements like buttons, text boxes, sliders, etc. You can also style things pretty easily with CSS and JS, to a point.

It's a flexible way of writing one-off applications; you can run them locally, remotely, or on someone else's machine in the cloud. One useful example is Tabula[1], a browser-based utility for extracting tabular data from PDFs. As it is often used by journalists and other organizations that don't want to leak the data they are analyzing all over the place, it is easy to run locally instead of uploading files to their website. You just point the browser to 'localhost:port' while the server is running.

[1]: https://github.com/tabulapdf/tabula

What's so hard about PDF text extraction? | Mar 2020

Expand Context ↕

Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.

Feel free to reachme at manuel at jazzido dot com

My New Favorite Tool for Reviewing PDFs: Okular | Mar 2019

My favorite tool for extracting data from PDFs: Tabula https://github.com/tabulapdf/tabula

How to Run SQL on PDF Files | Feb 2019

Expand Context ↕

Sure. But the tool posted here doesn't do that. It merely extracts text, and the "analysis" is a couple of regexes that are tailor-made for that particular pdf. Awk can do that much and a lot more.

If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.

1. https://github.com/tabulapdf/tabula

A Python Library to extract tabular data from PDFs | Oct 2018

Interesting. I've used Tabula [0] in the past with great success. I wonder how this compares.

[0]: https://github.com/tabulapdf/tabula

New Jersey Open Data Initiative | Jan 2017

Expand Context ↕

Unfortunately, it doesn't. From https://github.com/tabulapdf/tabula:

> Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.