What does HackerNews think of tesseract?

Tesseract Open Source OCR Engine (main repository)

Language: C++

#8 in Hacktoberfest

#2 in Machine learning

Google Calls In Help From Larry Page and Sergey Brin for A.I. Fight | Jan 2023

Expand Context ↕

> (…) better than Tesseract

Isn’t Tesseract also neural network-based?

https://github.com/tesseract-ocr/tesseract

PDF processing and analysis with open-source tools | Oct 2022

Expand Context ↕

> Would love to find a cheaper (local) option vs AWS

How about tesseract (https://github.com/tesseract-ocr/tesseract)

There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.

macOS screenshot tricks to impress your co-workers | Jun 2022

Expand Context ↕

For linux (or GNOME more specifically) there is Frog[1]. It uses Tesseract OCR[2] under the hood.

[1]: https://flathub.org/apps/details/com.github.tenderowl.frog

[2]: https://github.com/tesseract-ocr/tesseract

Show HN: An open source alternative to Evernote (Self Hosted) | Jun 2022

Expand Context ↕

https://www.elastic.co/guide/en/elasticsearch/plugins/curren... :

> [Teh ElasticSearch Core Ingest Attachment Processor Plugin]: The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

> The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then

Apache Tika supported formats > Images > TesseractOCR: https://tika.apache.org/2.4.0/formats.html https://tika.apache.org/2.4.0/formats.html#Image_formats :

> When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.

/? Meilisearch "ocr" GitHub;

Looks like e.g. paperbase (agpl) also implements ocr with tesseractocr: https://docs.paperbase.app/

tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract

/? https://github.com/awesome-selfhosted/awesome-selfhosted#sea... ctrl-f "ocr"

I returned my Remarkable2 | Nov 2021

Expand Context ↕

I've had pretty good outcomes with Tesseract, but it's command line only and it's not even a small fraction as simple and straightforward and plain useful as the built in tools on Mac OS X.

https://github.com/tesseract-ocr/tesseract

Georgian African American newspapers from 1886-1926 now available freely online | Mar 2021

Expand Context ↕

> we still don't have an easy to use and free high end OCR available

There is Tesseract: https://github.com/tesseract-ocr/tesseract

> Australia's Trove is getting humans to translate them‽

Your link clearly refers to correcting an existing transcription:

"While viewing digitised newspaper and gazette articles, you may notice that the text transcript doesn’t always match the text in the article. You have the power to fix this by editing the transcript to match the article text."

Most likely they used OCR software to generate the initial transcript, but allow users to correct the OCR output because they know the software is not perfect.

Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz | Dec 2020

Expand Context ↕

https://github.com/tesseract-ocr/tesseract seems to be written in c++ not python

Free PDF merging tool that works without any file upload to ensure privacy | Oct 2020

Expand Context ↕

It in fact screams at me to "update" my browser "to the lates [sic] version" of Chrone, Brave, and Opera in an alarming red banner:

Sorry, your Browser does not support the latest features we use. Our Application was tested on the following Browsers: Chrome, Brave and Opera. Please, update your browser to the lates version of one of supported browsers.

I'll just stick to the CLI commands I managed to cobble together for the same purpose. I've even managed to get OCR down with Tesseract[0]

Photo: https://imgur.com/a/wrqUs93 [0]: Tesseract: https://github.com/tesseract-ocr/tesseract

EasyOCR: Ready-to-use OCR with 40 languages | Jul 2020

Expand Context ↕

I was building an image search engine[0] a while back and faced the same issues you mentioned with OCR. What i realized is tesseract[1](one of the more popular ocr framework) works so long as you are able to provide it data similar to the one it was trained on.

We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.

Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.

TLDR;Some nifty image processing is required to make tesseract perform as expected.

[0] (https://www.askgoose.com) [1] (https://github.com/tesseract-ocr/tesseract)

EasyOCR: Ready-to-use OCR with 40 languages | Jul 2020

Why does it have to be Python based? You can call out to other processes or services. Tesseract[1], for example, is pretty easy to work with.

1: https://github.com/tesseract-ocr/tesseract

Ask HN: OCR framework for extracting formatted text | May 2020

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

Building personal search infrastructure for your knowledge and code | Jan 2020

Expand Context ↕

I have a ScanSnap scanner too (mine's an S1500 - I have had it for c10 years or so and it still works perfectly) and it's great to be able to search what used to be paper documents quickly and easily. It saves a lot of physical space as well, most documents I scan then shred immediately once I've verified the scan is good and backed up.

There are some reasonably good OCR tools on Linux now as well - I've been pretty happy with Tesseract[0]. It was an absolute pain to script everything to "just work" when I press the button on my scanner though.

Recoll[1] works very well for indexing documents for me including my OCRd scans. When that's not enough, I revert to pdfgrep.

0. https://github.com/tesseract-ocr/tesseract 1. https://www.lesbonscomptes.com/recoll/

Tesseract.js: Pure JavaScript OCR for 100 Languages | Dec 2019

In case it's not clear, Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005. [1]

As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

This (Tesseract.js) is a WASM port of the project by a separate group of people.

I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.

In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.

[1] https://github.com/tesseract-ocr/tesseract

2019 examples to compare OCR services: Amazon vs. Google vs. Microsoft | Jul 2019

Expand Context ↕

Tesseract[0] is the classic example. There's a bunch of advice for improving your accuracy with it, like making your images larger (literally just scale it up x2 or x4).

It would be interesting to the benchmark from the article repeated with different scaling options (or other preprocessing, depending on platform).

[0]: https://github.com/tesseract-ocr/tesseract

Coverity Scan Update | Jan 2019

Expand Context ↕

Systemd and tesseract-ocr both use it for example:

https://github.com/tesseract-ocr/tesseract https://github.com/systemd/systemd

systemd have also written their own QL query: https://github.com/systemd/systemd/blob/master/.lgtm/cpp-que... https://lgtm.com/projects/g/systemd/systemd/alerts/?mode=tre...

(full disclosure, I also work at Semmle)

Fine Print: Unusual legal footnotes | Apr 2018

Expand Context ↕

I used Tesseract OCR (https://github.com/tesseract-ocr/tesseract).

This is what it returned:

https://gist.github.com/jwilk/4bd58278fe9a6b88af1010616afe2b...

I manually corrected a few recognition errors, fixed the order of sections, unhyphenated a few words and added formatting.

Strategies for offline PGP key storage | Oct 2017

Expand Context ↕

I wonder how hard it would be to run a couple pages of dense print (though in a monospaced and consistent format) through an OCR system.

I might play with Tesseract[1] this weekend and see if this is even a feasible idea. If so, it makes the paper key storage a lot more palatable.

[1]:https://github.com/tesseract-ocr/tesseract

Ask HN: Does software exist to digitize scanned books and articles? | Jan 2017

Expand Context ↕

For handwritten character recognition, see:

https://www.tensorflow.org/tutorials/mnist/beginners/ (also google "tensorflow ocr")

http://yann.lecun.com/exdb/mnist/

CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions http://www.isical.ac.in/~crohme/

Closed-sourced API: http://mathpix.com https://photomath.net/en/

Best off-the-shelf OCR (originally developed by HP, now Google):

https://github.com/tesseract-ocr/tesseract

https://github.com/tesseract-ocr/tesseract/wiki

Two Clojure talks...

Machine Learning Live - Mike Anderson https://www.youtube.com/watch?v=QJ1qgCr09j8

Adventures in Understanding Documents - Scott Tuddenham https://www.youtube.com/watch?v=94NjRg8zoCA

Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages | Oct 2016

Expand Context ↕

Your comment (zoomed in Chrome on Win 10): http://i.imgur.com/uuFhw90.png

Tesseract.js analysis:

    Although Googie's API is certaihiy better,
    Tesseract.js should work simiiarly if you
    increase the font size.
    Screenshots taken
    on 'retiha’ devices are around the smailest
    text it can handie well.
    
    Edit:
    
    A screenshot of the same text at a higher
    resolution:
    httgs:[[imgurxomZaN/UGu
    
    Tesseract.js
    output: httgs://imguricom[a[hiIfM

This is a neat toy, but not impressive compared to the results from tesseract-ocr/tesseract [0]:

    $  curl -s http://i.imgur.com/uuFhw90.png \
        | tesseract stdin stdout

    Although Google's API is certainly better,
    Tesseract.js should work similarly if you
    increase the font size.
    Screenshots taken on 'retina' devices are
    around the smallest text it can handle well.
    
    Edit:
    A screenshot of the same text at a higher
    resolution: https:[ZimguncomlalWHGu
    Tesseract.js output:
    https:[[imgur.com[a[nilfM

Notice how Tesseract.js results suffer from being unable to differentiate between n's and h's, i's and l's.

[0] https://github.com/tesseract-ocr/tesseract

Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages | Oct 2016

Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.

https://github.com/tesseract-ocr/tesseract

Ask HN: What is the best open source OCR software supporting multiple languages? | Sep 2016

Tesseract[0] is a system that is broken in to different parts, at least one does layout analysis and another does the actual OCR. Output is a different layer again. I believe it is an open source adaptation of what Google used for its books project. The interface was less than polished a few years ago, to the point where getting it running at all was rather difficult. However, for multilingual work (including Chinese) it is probably ideal.[1] Note that if you are scanning books there are now some interesting open hardware systems appearing online that turn pages and take photos with cameras, so you can scan books - without cutting them up - to a high resolution.

[0] https://github.com/tesseract-ocr/tesseract [1] https://github.com/tesseract-ocr/langdata

Show HN: Convert PDF files into structured data | Jul 2016

Expand Context ↕

For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.

Show HN: Convert scanned documents into searchable PDFs | Dec 2015

Expand Context ↕

Given that the OCR'ed PDFs use the "GlyphLessFont" font, it seems that tesseract [1] is used.

[1] https://github.com/tesseract-ocr/tesseract

Hosted Microsoft OCR library: Free OCR API web service | Oct 2015

Expand Context ↕

Tesseract:

https://github.com/tesseract-ocr/tesseract

http://neilshroff.com/tesseract-ocr/doc/tesseracticdar2007.p...

https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_...

Code to transform Hillary's emails from raw PDF documents to a SQLite database | Sep 2015

Expand Context ↕