What does HackerNews think of pdfplumber?
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
One way to do it would be to do impedance matching: maximizing usefulness by bringing Python to the universe the person who asked the question lives in: Excel, Word, PDF, email.
This means "Automate the Boring Stuff"[0] to ease into programming with something specific to do that is relevant and useful right off the bat for manipulating files, spreadsheets, PDF documents, etc.
Openpyxl[1] for playing more with Excel files, and basically searching "Excel Python" on the internet for more use cases.
Python-docx[2] for playing with Word documents, extracting content from tables, etc.
PdfPlumber[3] for playing with PDFs. Sometimes you have data in .pptx files, and you can use LibreOffice to convert all of them to PDFs because you hate the pptx format:
libreoffice --headless --invisible --convert-to pdf *.pptx
Also, if you both are using notebooks, take a look at what we're building at https://iko.ai. I posted about it a bit earlier[4]. It has no-setup real-time Jupyter notebooks, and all the jazz. Focuses on solving machine learning problems since we built it for ourselves to slash projects' time.- [0]: https://nostarch.com/automatestuff2
- [1]: https://openpyxl.readthedocs.io/en/stable/
- [2]: https://python-docx.readthedocs.io/en/latest/
- https://news.ycombinator.com/item?id=24460142
- https://news.ycombinator.com/item?id=24471058
There also is a Python library:
I've done lots of work in this space, including computer vision and ML approaches, and Tabula[1] which was the gold standard for extraction.
PDF Plumber is better on just about every example I've tried.
- http://tabula.technology/ (Java)
- https://github.com/jsvine/pdfplumber (pure Python as well)