What does HackerNews think of doctr?

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Language: Python

#14 in Deep learning
Yup! But I'm still exploring options. (any recommendations would be welcomed!) Here are some candidates I'm considering:

- https://github.com/mindee/doctr

- https://github.com/open-mmlab/mmocr

- https://github.com/PaddlePaddle/PaddleOCR (honestly I don't know Mandarin so I'm a bit stuck)

- https://github.com/clovaai/donut -- While it's primarily an "OCR-free document understanding transformer," I think it's worth experimenting with. Think I can sort this out by letting the LLM reason through it multiple times (although this will impact performance)

- yesterday got a suggestion to consider https://github.com/kakaobrain/pororo -- don't think development is still active but the results are pretty great on Korean text

EasyOCR is a popular project if you are in an environment where you can use run Python and PyTorch (https://github.com/JaidedAI/EasyOCR). Other open source projects of note are PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) and docTR (https://github.com/mindee/doctr).
Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries
There's also DocTR which can do text detection and extraction out of the box.

It's command line driven but can display the detected text as an overlay of the document.

https://github.com/mindee/doctr

If you want to OCR a document image, modern versions of Tesseract can work well. If you last used it a few years ago, the recognition has improved since due to a new text recognition algorithm that uses modern (deep learning) techniques. Browser demo using a modern version: https://robertknight.github.io/tesseract-wasm/.

OCR processing typically consist of two major steps: detecting/locating words or lines of text on the page, and recognizing lines of text.

Tesseract's text recognition uses modern methods, but the text detection phase is still based on classical methods involving a lot of heuristics, and you may need to experiment with various configuration variables to get the best results. As a result it can fail to detect text if you present it with something other than a reasonably clean document image.

Doctr (https://github.com/mindee/doctr) is a new package that uses modern methods for both text detection and recognition. It is pretty new however and I expect will take more time and effort to mature.