As University libraries have moved online, one reads many poorly scanned journal articles. I often wonder about taking the time to clean them up. What replaces temporal information here is the same characters appearing over and over.

So of course I read this article hoping to learn about an off-the-shelf tool that would do a great job of scanned text reconstruction. Alas, the best candidates were "no code available."

Not off the shelf but here are some tools. I have no experience with them.

Wolf binarization - I think it makes the text more clear before OCR.

https://github.com/chriswolfvision/local_adaptive_binarizati...

This thing OCRs the pdf using Tesseract OCR

https://github.com/ocrmypdf/OCRmyPDF/

Two other pdf tools

https://github.com/qpdf/qpdf

https://github.com/pikepdf/pikepdf