What does HackerNews think of tesseract?

Tesseract Open Source OCR Engine (main repository)

Language: C++

> Would love to find a cheaper (local) option vs AWS

How about tesseract (https://github.com/tesseract-ocr/tesseract)

There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.

https://www.elastic.co/guide/en/elasticsearch/plugins/curren... :

> [Teh ElasticSearch Core Ingest Attachment Processor Plugin]: The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

> The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then

Apache Tika supported formats > Images > TesseractOCR: https://tika.apache.org/2.4.0/formats.html https://tika.apache.org/2.4.0/formats.html#Image_formats :

> When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.

/? Meilisearch "ocr" GitHub;

Looks like e.g. paperbase (agpl) also implements ocr with tesseractocr: https://docs.paperbase.app/

tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract

/? https://github.com/awesome-selfhosted/awesome-selfhosted#sea... ctrl-f "ocr"

I've had pretty good outcomes with Tesseract, but it's command line only and it's not even a small fraction as simple and straightforward and plain useful as the built in tools on Mac OS X.

https://github.com/tesseract-ocr/tesseract

> we still don't have an easy to use and free high end OCR available

There is Tesseract: https://github.com/tesseract-ocr/tesseract

> Australia's Trove is getting humans to translate them‽

Your link clearly refers to correcting an existing transcription:

"While viewing digitised newspaper and gazette articles, you may notice that the text transcript doesn’t always match the text in the article. You have the power to fix this by editing the transcript to match the article text."

Most likely they used OCR software to generate the initial transcript, but allow users to correct the OCR output because they know the software is not perfect.

It in fact screams at me to "update" my browser "to the lates [sic] version" of Chrone, Brave, and Opera in an alarming red banner:

Sorry, your Browser does not support the latest features we use. Our Application was tested on the following Browsers: Chrome, Brave and Opera. Please, update your browser to the lates version of one of supported browsers.

I'll just stick to the CLI commands I managed to cobble together for the same purpose. I've even managed to get OCR down with Tesseract[0]

Photo: https://imgur.com/a/wrqUs93 [0]: Tesseract: https://github.com/tesseract-ocr/tesseract

I was building an image search engine[0] a while back and faced the same issues you mentioned with OCR. What i realized is tesseract[1](one of the more popular ocr framework) works so long as you are able to provide it data similar to the one it was trained on.

We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.

Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.

TLDR;Some nifty image processing is required to make tesseract perform as expected.

[0] (https://www.askgoose.com) [1] (https://github.com/tesseract-ocr/tesseract)

Why does it have to be Python based? You can call out to other processes or services. Tesseract[1], for example, is pretty easy to work with.

1: https://github.com/tesseract-ocr/tesseract

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/
I have a ScanSnap scanner too (mine's an S1500 - I have had it for c10 years or so and it still works perfectly) and it's great to be able to search what used to be paper documents quickly and easily. It saves a lot of physical space as well, most documents I scan then shred immediately once I've verified the scan is good and backed up.

There are some reasonably good OCR tools on Linux now as well - I've been pretty happy with Tesseract[0]. It was an absolute pain to script everything to "just work" when I press the button on my scanner though.

Recoll[1] works very well for indexing documents for me including my OCRd scans. When that's not enough, I revert to pdfgrep.

0. https://github.com/tesseract-ocr/tesseract 1. https://www.lesbonscomptes.com/recoll/

In case it's not clear, Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005. [1]

As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

This (Tesseract.js) is a WASM port of the project by a separate group of people.

I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.

In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.

[1] https://github.com/tesseract-ocr/tesseract

Tesseract[0] is the classic example. There's a bunch of advice for improving your accuracy with it, like making your images larger (literally just scale it up x2 or x4).

It would be interesting to the benchmark from the article repeated with different scaling options (or other preprocessing, depending on platform).

[0]: https://github.com/tesseract-ocr/tesseract

I used Tesseract OCR (https://github.com/tesseract-ocr/tesseract).

This is what it returned:

https://gist.github.com/jwilk/4bd58278fe9a6b88af1010616afe2b...

I manually corrected a few recognition errors, fixed the order of sections, unhyphenated a few words and added formatting.

I wonder how hard it would be to run a couple pages of dense print (though in a monospaced and consistent format) through an OCR system.

I might play with Tesseract[1] this weekend and see if this is even a feasible idea. If so, it makes the paper key storage a lot more palatable.

[1]:https://github.com/tesseract-ocr/tesseract

For handwritten character recognition, see:

https://www.tensorflow.org/tutorials/mnist/beginners/ (also google "tensorflow ocr")

http://yann.lecun.com/exdb/mnist/

CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions http://www.isical.ac.in/~crohme/

Closed-sourced API: http://mathpix.com https://photomath.net/en/

Best off-the-shelf OCR (originally developed by HP, now Google):

https://github.com/tesseract-ocr/tesseract

https://github.com/tesseract-ocr/tesseract/wiki

Two Clojure talks...

Machine Learning Live - Mike Anderson https://www.youtube.com/watch?v=QJ1qgCr09j8

Adventures in Understanding Documents - Scott Tuddenham https://www.youtube.com/watch?v=94NjRg8zoCA

Your comment (zoomed in Chrome on Win 10): http://i.imgur.com/uuFhw90.png

Tesseract.js analysis:

    Although Googie's API is certaihiy better,
    Tesseract.js should work simiiarly if you
    increase the font size.
    Screenshots taken
    on 'retiha’ devices are around the smailest
    text it can handie well.
    
    Edit:
    
    A screenshot of the same text at a higher
    resolution:
    httgs:[[imgurxomZaN/UGu
    
    Tesseract.js
    output: httgs://imguricom[a[hiIfM
This is a neat toy, but not impressive compared to the results from tesseract-ocr/tesseract [0]:

    $  curl -s http://i.imgur.com/uuFhw90.png \
        | tesseract stdin stdout

    Although Google's API is certainly better,
    Tesseract.js should work similarly if you
    increase the font size.
    Screenshots taken on 'retina' devices are
    around the smallest text it can handle well.
    
    Edit:
    A screenshot of the same text at a higher
    resolution: https:[ZimguncomlalWHGu
    Tesseract.js output:
    https:[[imgur.com[a[nilfM
Notice how Tesseract.js results suffer from being unable to differentiate between n's and h's, i's and l's.

[0] https://github.com/tesseract-ocr/tesseract

Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.

https://github.com/tesseract-ocr/tesseract

Tesseract[0] is a system that is broken in to different parts, at least one does layout analysis and another does the actual OCR. Output is a different layer again. I believe it is an open source adaptation of what Google used for its books project. The interface was less than polished a few years ago, to the point where getting it running at all was rather difficult. However, for multilingual work (including Chinese) it is probably ideal.[1] Note that if you are scanning books there are now some interesting open hardware systems appearing online that turn pages and take photos with cameras, so you can scan books - without cutting them up - to a high resolution.

[0] https://github.com/tesseract-ocr/tesseract [1] https://github.com/tesseract-ocr/langdata

For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.
Given that the OCR'ed PDFs use the "GlyphLessFont" font, it seems that tesseract [1] is used.

[1] https://github.com/tesseract-ocr/tesseract

> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for instead.

1 - https://github.com/tesseract-ocr/tesseract

2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...

3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...