As University libraries have moved online, one reads many poorly scanned journal articles. I often wonder about taking the time to clean them up. What replaces temporal information here is the same characters appearing over and over.
So of course I read this article hoping to learn about an off-the-shelf tool that would do a great job of scanned text reconstruction. Alas, the best candidates were "no code available."
Not off the shelf but here are some tools. I have no experience with them.
Wolf binarization - I think it makes the text more clear before OCR.
This thing OCRs the pdf using Tesseract OCR
Two other pdf tools