What does HackerNews think of pdfplumber?

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Language: Python

I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.

[1]: https://github.com/jsvine/pdfplumber

>My (flippant) reaction to a friend that brought the comment to my attention was unhelpful; "Step 1 day 1, quit." So he has challenged me to write eight helpful blog posts during the remainder of my Garden Leave.

One way to do it would be to do impedance matching: maximizing usefulness by bringing Python to the universe the person who asked the question lives in: Excel, Word, PDF, email.

This means "Automate the Boring Stuff"[0] to ease into programming with something specific to do that is relevant and useful right off the bat for manipulating files, spreadsheets, PDF documents, etc.

Openpyxl[1] for playing more with Excel files, and basically searching "Excel Python" on the internet for more use cases.

Python-docx[2] for playing with Word documents, extracting content from tables, etc.

PdfPlumber[3] for playing with PDFs. Sometimes you have data in .pptx files, and you can use LibreOffice to convert all of them to PDFs because you hate the pptx format:

  libreoffice --headless --invisible --convert-to pdf *.pptx

Also, if you both are using notebooks, take a look at what we're building at https://iko.ai. I posted about it a bit earlier[4]. It has no-setup real-time Jupyter notebooks, and all the jazz. Focuses on solving machine learning problems since we built it for ourselves to slash projects' time.

- [0]: https://nostarch.com/automatestuff2

- [1]: https://openpyxl.readthedocs.io/en/stable/

- [2]: https://python-docx.readthedocs.io/en/latest/

- [3]: https://github.com/jsvine/pdfplumber

- [4]: https://news.ycombinator.com/item?id=25608098

A shout-out for PDF Plumber: https://github.com/jsvine/pdfplumber

I've done lots of work in this space, including computer vision and ML approaches, and Tabula[1] which was the gold standard for extraction.

PDF Plumber is better on just about every example I've tried.

[1] https://tabula.technology/

Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:

- http://tabula.technology/ (Java)

- https://github.com/jsvine/pdfplumber (pure Python as well)