What does HackerNews think of tesseract?
Tesseract Open Source OCR Engine (main repository)
Isn’t Tesseract also neural network-based?
How about tesseract (https://github.com/tesseract-ocr/tesseract)
There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.
[1]: https://flathub.org/apps/details/com.github.tenderowl.frog
> [Teh ElasticSearch Core Ingest Attachment Processor Plugin]: The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
> The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then
Apache Tika supported formats > Images > TesseractOCR: https://tika.apache.org/2.4.0/formats.html https://tika.apache.org/2.4.0/formats.html#Image_formats :
> When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.
/? Meilisearch "ocr" GitHub;
Looks like e.g. paperbase (agpl) also implements ocr with tesseractocr: https://docs.paperbase.app/
tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract
/? https://github.com/awesome-selfhosted/awesome-selfhosted#sea... ctrl-f "ocr"
There is Tesseract: https://github.com/tesseract-ocr/tesseract
> Australia's Trove is getting humans to translate them‽
Your link clearly refers to correcting an existing transcription:
"While viewing digitised newspaper and gazette articles, you may notice that the text transcript doesn’t always match the text in the article. You have the power to fix this by editing the transcript to match the article text."
Most likely they used OCR software to generate the initial transcript, but allow users to correct the OCR output because they know the software is not perfect.
Sorry, your Browser does not support the latest features we use. Our Application was tested on the following Browsers: Chrome, Brave and Opera. Please, update your browser to the lates version of one of supported browsers.
I'll just stick to the CLI commands I managed to cobble together for the same purpose. I've even managed to get OCR down with Tesseract[0]
Photo: https://imgur.com/a/wrqUs93 [0]: Tesseract: https://github.com/tesseract-ocr/tesseract
We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.
Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.
TLDR;Some nifty image processing is required to make tesseract perform as expected.
[0] (https://www.askgoose.com) [1] (https://github.com/tesseract-ocr/tesseract)
There are some reasonably good OCR tools on Linux now as well - I've been pretty happy with Tesseract[0]. It was an absolute pain to script everything to "just work" when I press the button on my scanner though.
Recoll[1] works very well for indexing documents for me including my OCRd scans. When that's not enough, I revert to pdfgrep.
0. https://github.com/tesseract-ocr/tesseract 1. https://www.lesbonscomptes.com/recoll/
As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).
This (Tesseract.js) is a WASM port of the project by a separate group of people.
I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.
In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.
It would be interesting to the benchmark from the article repeated with different scaling options (or other preprocessing, depending on platform).
https://github.com/tesseract-ocr/tesseract https://github.com/systemd/systemd
systemd have also written their own QL query: https://github.com/systemd/systemd/blob/master/.lgtm/cpp-que... https://lgtm.com/projects/g/systemd/systemd/alerts/?mode=tre...
(full disclosure, I also work at Semmle)
This is what it returned:
https://gist.github.com/jwilk/4bd58278fe9a6b88af1010616afe2b...
I manually corrected a few recognition errors, fixed the order of sections, unhyphenated a few words and added formatting.
I might play with Tesseract[1] this weekend and see if this is even a feasible idea. If so, it makes the paper key storage a lot more palatable.
https://www.tensorflow.org/tutorials/mnist/beginners/ (also google "tensorflow ocr")
http://yann.lecun.com/exdb/mnist/
CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions http://www.isical.ac.in/~crohme/
Closed-sourced API: http://mathpix.com https://photomath.net/en/
Best off-the-shelf OCR (originally developed by HP, now Google):
https://github.com/tesseract-ocr/tesseract
https://github.com/tesseract-ocr/tesseract/wiki
Two Clojure talks...
Machine Learning Live - Mike Anderson https://www.youtube.com/watch?v=QJ1qgCr09j8
Adventures in Understanding Documents - Scott Tuddenham https://www.youtube.com/watch?v=94NjRg8zoCA
Tesseract.js analysis:
Although Googie's API is certaihiy better,
Tesseract.js should work simiiarly if you
increase the font size.
Screenshots taken
on 'retiha’ devices are around the smailest
text it can handie well.
Edit:
A screenshot of the same text at a higher
resolution:
httgs:[[imgurxomZaN/UGu
Tesseract.js
output: httgs://imguricom[a[hiIfM
This is a neat toy, but not impressive compared to the results from tesseract-ocr/tesseract [0]: $ curl -s http://i.imgur.com/uuFhw90.png \
| tesseract stdin stdout
Although Google's API is certainly better,
Tesseract.js should work similarly if you
increase the font size.
Screenshots taken on 'retina' devices are
around the smallest text it can handle well.
Edit:
A screenshot of the same text at a higher
resolution: https:[ZimguncomlalWHGu
Tesseract.js output:
https:[[imgur.com[a[nilfM
Notice how Tesseract.js results suffer from being unable to differentiate between n's and h's, i's and l's.[0] https://github.com/tesseract-ocr/tesseract [1] https://github.com/tesseract-ocr/langdata
Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.
As for searching the PDF contents, Solr[3] might be what you are looking for instead.
1 - https://github.com/tesseract-ocr/tesseract
2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...
3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...