(1) For note taking I stumbled across anno[1] via[2] two weeks ago. It's a python flask application which you run on your localhost. You write markdown which gets stored locally as file and is rendered as html using pandoc[3]. It's really basic but I love it.

(2) For physical documents I use a Fujitsu ScanSnap iX500[4] for scanning. A runtime-licencse of ABBYY FineReader for OCR is included. The resulting PDF has embedded text which I extract using pdftotext[5]. I wrote a python application to search and tag this documents. It loads all the text in-memory which is perfecty fine as I have < 10,000 documents. I use it since 5 years and it works OK.

[1] https://github.com/gwgundersen/anno

[2] https://news.ycombinator.com/item?id=22033792

[3] https://pandoc.org/

[4] https://www.fujitsu.com/global/products/computing/peripheral...

[5] https://en.wikipedia.org/wiki/Pdftotext

I have a ScanSnap scanner too (mine's an S1500 - I have had it for c10 years or so and it still works perfectly) and it's great to be able to search what used to be paper documents quickly and easily. It saves a lot of physical space as well, most documents I scan then shred immediately once I've verified the scan is good and backed up.

There are some reasonably good OCR tools on Linux now as well - I've been pretty happy with Tesseract[0]. It was an absolute pain to script everything to "just work" when I press the button on my scanner though.

Recoll[1] works very well for indexing documents for me including my OCRd scans. When that's not enough, I revert to pdfgrep.

0. https://github.com/tesseract-ocr/tesseract 1. https://www.lesbonscomptes.com/recoll/