What does HackerNews think of pdfplumber?

Show HN: I am building a new Python library to read/write PDF files | Nov 2022

I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.

[1]: https://github.com/jsvine/pdfplumber

Ask HN: What should go in an Excel-to-Python equivalent of a couch-to-5k? | Jan 2021

>My (flippant) reaction to a friend that brought the comment to my attention was unhelpful; "Step 1 day 1, quit." So he has challenged me to write eight helpful blog posts during the remainder of my Garden Leave.

One way to do it would be to do impedance matching: maximizing usefulness by bringing Python to the universe the person who asked the question lives in: Excel, Word, PDF, email.

This means "Automate the Boring Stuff"[0] to ease into programming with something specific to do that is relevant and useful right off the bat for manipulating files, spreadsheets, PDF documents, etc.

Openpyxl[1] for playing more with Excel files, and basically searching "Excel Python" on the internet for more use cases.

Python-docx[2] for playing with Word documents, extracting content from tables, etc.

PdfPlumber[3] for playing with PDFs. Sometimes you have data in .pptx files, and you can use LibreOffice to convert all of them to PDFs because you hate the pptx format:

  libreoffice --headless --invisible --convert-to pdf *.pptx

Also, if you both are using notebooks, take a look at what we're building at https://iko.ai. I posted about it a bit earlier[4]. It has no-setup real-time Jupyter notebooks, and all the jazz. Focuses on solving machine learning problems since we built it for ourselves to slash projects' time.

- [0]: https://nostarch.com/automatestuff2

- [1]: https://openpyxl.readthedocs.io/en/stable/

- [2]: https://python-docx.readthedocs.io/en/latest/

- [3]: https://github.com/jsvine/pdfplumber

- [4]: https://news.ycombinator.com/item?id=25608098

Ask HN: Software for Reading Pdfs? | Sep 2020

Check the following threads:

- https://news.ycombinator.com/item?id=24460142

- https://news.ycombinator.com/item?id=24471058

There also is a Python library:

- https://github.com/jsvine/pdfplumber

What's so hard about PDF text extraction? | Mar 2020

A shout-out for PDF Plumber: https://github.com/jsvine/pdfplumber

I've done lots of work in this space, including computer vision and ML approaches, and Tabula[1] which was the gold standard for extraction.

PDF Plumber is better on just about every example I've tried.

[1] https://tabula.technology/

A Python Library to extract tabular data from PDFs | Oct 2018

An alternative: https://github.com/jsvine/pdfplumber

Using Google Cloud Vision OCR to extract text from photos and scanned documents | Mar 2016

Expand Context ↕

Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:

- http://tabula.technology/ (Java)

- https://github.com/jsvine/pdfplumber (pure Python as well)