What does HackerNews think of OCRmyPDF?
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
It uses Tesseract under the hood. Results tend to just be OK in my experience.
It does quite some preprocessing on the PDF pages before passing it on to tesseract.
For stuff like auto-rotate, auto-de-skew, multipage PDF output; those can all be handled by external programs, and realistically that's what I'd recommend.
convert (part of imagemagick) can handle rotate and page merging, as well as conversion to other formats.
For de-skew, and possibly other features you're looking for, ScanTailor [1] is probably a good option. Can't speak on it's functional quality though, I prefer to manually fix skew for photos.
Convert also has a deskew option [2], but I've not tested it, and a few online results [4] seem to point to it not working that well.
There's also OCRmyPDF [3] which I found excellent for OCRing scanned print documents.
---
Below are some notes on the other things you mentioned. Sorry about the formatting, I couldn't find anything about markdown or HTML on HN. I just copied this from Zim
Long page scanning
* In the main xsane control window (upper left normally), select the Window tab, select Show standard options
* Keyboard shortcut: CTRL+5
* If the Standard Options window is already open, click the option again after it closes, to reopen it.
* In the Standard Options menu, set page height to 0.000.
* This allows the scanner to scan documents longer than standard A4 sizing. For example, you can scan legal sized or longer documents.
* The scanner with continue scanning the document until the paper exits the scanner physically.
---
Continuous scanning:
On the main xsane window, change the following settings:
* Where it says ADF Front or ADF Back, change to ADF Duplex if you are scanning duplex. Alternatively change to ADF Front or ADF Back if you know which side you want scanned.
* If you are scanning Duplex, change the number 1 to 2; next to the icon of 4 papers on top of each other
* This represents how many pages are scanned.
\* If you have multiple documents in the feeder (as the S1300i supports loading 20 documents at a time), you should set the number of pages to scan (same option) to either the number of pages, if scanning single sided, or DOUBLE the total number of pages, if scanning Duplex.
\* Alternatively, set it to a number higher than 40, and the scanner will continue scanning until there are no pages left.
[1] https://scantailor.org[2] https://www.imagemagick.org/script/command-line-options.php#...
[3] https://github.com/ocrmypdf/OCRmyPDF
[4] https://stackoverflow.com/questions/41546181/how-to-deskew-a...
Just don't use the build in OCR function. For some reason it always made many errors compared to OCRmyPDF (Tesseract OCR) [1].
Sadly the Vuescan AUR package (unofficial) [2] breaks several times a week because of sha256sum check miss-match since the Vuescan developer pushes many silent updates.
Also the developer behind Vuescan is against other downloading options [3].
[1] https://github.com/ocrmypdf/OCRmyPDF.
[2] https://aur.archlinux.org/packages/vuescan-bin
[3] https://lists.archlinux.org/pipermail/aur-general/2021-May/0...
Wolf binarization - I think it makes the text more clear before OCR.
https://github.com/chriswolfvision/local_adaptive_binarizati...
This thing OCRs the pdf using Tesseract OCR
https://github.com/ocrmypdf/OCRmyPDF/
Two other pdf tools