What does HackerNews think of sonic?
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
If you are comfortable with Go or Rust you should look at the latest projects in this space:
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.
Source: I'm currently working on the third version of my search engine and I've been studying this for 10 years.
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.
If not, I would love to see this add a "forward to webhook" option. I would be happy to write up a real backend that parsed the content and indexed it.
Actually, there are lots of OS projects for this: https://github.com/quickwit-oss/tantivy, https://github.com/valeriansaliou/sonic, https://github.com/mosuka/phalanx, https://github.com/meilisearch/MeiliSearch, etc...
I would think that with the thousands (tens of thousands?) of pages I would index just browsing each year it would be relativity easy to find ways to automatically expand the index to include links that appear multiple times in those pages or some other heuristic
The only let down for me is their very slow indexing speed when it comes to millions to tens of millions of data (personal experience, also experienced by other users in their public slack workspace).
Other competitors / alternatives:
- https://github.com/typesense/typesense
- https://github.com/quickwit-oss/quickwit
- https://github.com/elastic/elasticsearch
- https://github.com/valeriansaliou/sonic
A decent feature-by-feature comparison:
- https://typesense.org/typesense-vs-algolia-vs-elasticsearch-...
If you guys know other open-source competitors / alternatives, I'd love to check those out!
[1] https://github.com/toshi-search/Toshi
And I personally stopped using them after a really bad experience I had with their "developers". They don't really care about you and it shows, also, they were kind of rude when I reported some bugs to them.
I moved to typesense and it's a whole different world, their creators truly enjoy that you're using their product; same thing with sonic, Valerian is the kind of hacker you'd want as a friend, super talented, super easy going, you could ask a completely dumb question on their GH and he takes the time to explain things to you at length. I know its open source, I know I didn't pay a dime, but for me, that kind of attitude makes it or break it. Plus, you actually get a superior product.
If you just want the fuzzy-search part (query string -> list of matching document ids) and don't want to pay for GBs of RAM, sonic [1] seems to be an interesting project. It's very fast (μs) and uses very little RAM but doesn't offer DB-like features such as sorting, schemas/fields, scoring etc. It's more of a low-level primitive for building your own search feature than an integrated search db that's ready to use out of the box.
[1] https://github.com/valeriansaliou/sonic
1. Toshi https://github.com/toshi-search/Toshi (Rust, 3.1k stars)
2. Tantivy https://github.com/tantivy-search/tantivy (Rust, 4.5k stars)
3. PISA https://github.com/pisa-engine/pisa (C++, 486 stars)
4. Bleve https://github.com/blevesearch/bleve (Go, 7.4k stars)
5. Sonic https://github.com/valeriansaliou/sonic (Rust, 10.9k stars)
6. Partial comparison https://mosuka.github.io/search-benchmark-game/ (tantivy Vs lucene Vs pisa Vs bleve)
7. Bayard https://github.com/bayard-search/bayard (Rust, on top of Tantivy, 1.4k stars)
8. Blast https://github.com/mosuka/blast (Go, on top of Bleve, 930 stars)
Algolia alternatives with some compatibility
1. MeiliSearch https://github.com/meilisearch/MeiliSearch (Rust, 12.4k stars)
2. typesense https://github.com/typesense/typesense (C++, 5.1k stars)
> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query....
> When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.
> At the end, we decided to build our own search backend, designed to be simple and lightweight on resources
This is actually very interesting project. Do we have some benchmark that sonic works on huge scale with lot of data?