What does HackerNews think of OCRmyPDF?

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Language: Python

#61 in Python
OCRmyPDF is the typical answer: https://github.com/ocrmypdf/OCRmyPDF

It uses Tesseract under the hood. Results tend to just be OK in my experience.

You might be interested in https://github.com/ocrmypdf/OCRmyPDF then.

It does quite some preprocessing on the PDF pages before passing it on to tesseract.

If it's on the dark web then they probably know how to use 'ocrmypdf' as well (which uses tesseract under the hood).

https://github.com/ocrmypdf/OCRmyPDF

Hi, I finished part of the writeup, mainly as the Fujitsu was a gift, but I want to refine it a bit before posting.

For stuff like auto-rotate, auto-de-skew, multipage PDF output; those can all be handled by external programs, and realistically that's what I'd recommend.

convert (part of imagemagick) can handle rotate and page merging, as well as conversion to other formats.

For de-skew, and possibly other features you're looking for, ScanTailor [1] is probably a good option. Can't speak on it's functional quality though, I prefer to manually fix skew for photos.

Convert also has a deskew option [2], but I've not tested it, and a few online results [4] seem to point to it not working that well.

There's also OCRmyPDF [3] which I found excellent for OCRing scanned print documents.

---

Below are some notes on the other things you mentioned. Sorry about the formatting, I couldn't find anything about markdown or HTML on HN. I just copied this from Zim

Long page scanning

* In the main xsane control window (upper left normally), select the Window tab, select Show standard options

* Keyboard shortcut: CTRL+5

* If the Standard Options window is already open, click the option again after it closes, to reopen it.

* In the Standard Options menu, set page height to 0.000.

* This allows the scanner to scan documents longer than standard A4 sizing. For example, you can scan legal sized or longer documents.

* The scanner with continue scanning the document until the paper exits the scanner physically.

---

Continuous scanning:

On the main xsane window, change the following settings:

* Where it says ADF Front or ADF Back, change to ADF Duplex if you are scanning duplex. Alternatively change to ADF Front or ADF Back if you know which side you want scanned.

* If you are scanning Duplex, change the number 1 to 2; next to the icon of 4 papers on top of each other

* This represents how many pages are scanned.

  \* If you have multiple documents in the feeder (as the S1300i supports loading 20 documents at a time), you should set the number of pages to scan (same option) to either the number of pages, if scanning single sided, or DOUBLE the total number of pages, if scanning Duplex. 

   \* Alternatively, set it to a number higher than 40, and the scanner will continue scanning until there are no pages left.

[1] https://scantailor.org

[2] https://www.imagemagick.org/script/command-line-options.php#...

[3] https://github.com/ocrmypdf/OCRmyPDF

[4] https://stackoverflow.com/questions/41546181/how-to-deskew-a...

I love Vuescan. The UI could use a bit of polishing, but the software works perfectly.

Just don't use the build in OCR function. For some reason it always made many errors compared to OCRmyPDF (Tesseract OCR) [1].

Sadly the Vuescan AUR package (unofficial) [2] breaks several times a week because of sha256sum check miss-match since the Vuescan developer pushes many silent updates.

Also the developer behind Vuescan is against other downloading options [3].

[1] https://github.com/ocrmypdf/OCRmyPDF.

[2] https://aur.archlinux.org/packages/vuescan-bin

[3] https://lists.archlinux.org/pipermail/aur-general/2021-May/0...

Not off the shelf but here are some tools. I have no experience with them.

Wolf binarization - I think it makes the text more clear before OCR.

https://github.com/chriswolfvision/local_adaptive_binarizati...

This thing OCRs the pdf using Tesseract OCR

https://github.com/ocrmypdf/OCRmyPDF/

Two other pdf tools

https://github.com/qpdf/qpdf

https://github.com/pikepdf/pikepdf