What does HackerNews think of sonic?

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Language: Rust

#6 in Database
#28 in Rust
#4 in Server
If you don't need advanced search features, you can use Sonic (https://github.com/valeriansaliou/sonic). It's blazing fast and you can save lot of money on servers.
I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":

- Tries (patricia, radix, etc...)

- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)

- Consensus (raft, paxos, etc..)

- Block storage (disk block size optimizations, mmap files, delta storage, etc..)

- Probabilistic filters (hyperloloog, bloom filters, etc...)

- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)

- Ranking (pagerank, tf/idf, bm25, etc...)

- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)

- HTML (document parsing/lexing)

- Images (exif extraction, removal, resizing / proxying, etc...)

- Queues (SQS, NATS, Apollo, etc...)

- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)

- Rate limiting (leaky bucket, windowed, etc...)

- Compression

- Applied linear algebra

- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)

- etc...

I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.

If you are comfortable with Go or Rust you should look at the latest projects in this space:

- https://github.com/quickwit-oss/tantivy

- https://github.com/valeriansaliou/sonic

- https://github.com/mosuka/phalanx

- https://github.com/meilisearch/MeiliSearch

- https://github.com/blevesearch/bleve

- https://github.com/thomasjungblut/go-sstables

A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

Source: I'm currently working on the third version of my search engine and I've been studying this for 10 years.

There isn't a one-size-fits all approach, but I've never worked on a project that encompasses as many computer science algorithms as a search engine.

- Tries (patricia, radix, etc...)

- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)

- Consensus (raft, paxos, etc..)

- Block storage (disk block size optimizations, mmap files, delta storage, etc..)

- Probabilistic filters (hyperloloog, bloom filters, etc...)

- Binary Search (sstables, sorted inverted indexes)

- Ranking (pagerank, tf/idf, bm25, etc...)

- NLP (stemming, POS tagging, subject identification, etc...)

- HTML (document parsing/lexing)

- Images (exif extraction, removal, resizing / proxying, etc...)

- Queues (SQS, NATS, Apollo, etc...)

- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)

- Rate limiting (leaky bucket, windowed, etc...)

- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)

- etc...

I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.

- https://github.com/quickwit-oss/tantivy

- https://github.com/valeriansaliou/sonic

- https://github.com/mosuka/phalanx

- https://github.com/meilisearch/MeiliSearch

- https://github.com/blevesearch/bleve

- https://github.com/thomasjungblut/go-sstables

A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.

Is there a tool that automatically forwards every URL + HTML of the page you visit to a webhook so you could write an endpoint that would index everything?

If not, I would love to see this add a "forward to webhook" option. I would be happy to write up a real backend that parsed the content and indexed it.

Actually, there are lots of OS projects for this: https://github.com/quickwit-oss/tantivy, https://github.com/valeriansaliou/sonic, https://github.com/mosuka/phalanx, https://github.com/meilisearch/MeiliSearch, etc...

I would think that with the thousands (tens of thousands?) of pages I would index just browsing each year it would be relativity easy to find ways to automatically expand the index to include links that appear multiple times in those pages or some other heuristic

Awesome. I've tried MeiliSearch and it is very easy to set up and use. It works very well for less than a million of data.

The only let down for me is their very slow indexing speed when it comes to millions to tens of millions of data (personal experience, also experienced by other users in their public slack workspace).

Other competitors / alternatives:

- https://github.com/typesense/typesense

- https://github.com/quickwit-oss/quickwit

- https://github.com/elastic/elasticsearch

- https://github.com/valeriansaliou/sonic

A decent feature-by-feature comparison:

- https://typesense.org/typesense-vs-algolia-vs-elasticsearch-...

If you guys know other open-source competitors / alternatives, I'd love to check those out!

There are also Toshi[1] and Sonic[2] in Rust. And Vector[3] as a Logstash alternative too. There is an issue[4] proposing to integrate Vector with Sonic and Toshi. Maybe Zinc can pursue this goal too. Always good to see people who realize that Java is unwieldy monster that will eat all your memory. Native is a way to go for big systems.

[1] https://github.com/toshi-search/Toshi

[2] https://github.com/valeriansaliou/sonic

[3] https://vector.dev/

[4] https://github.com/vectordotdev/vector/issues/988

If your dataset is small enough or you have plenty of resources, and you don’t need any fancy customizabilty and multitenancy (always searching only a subset of all documents, filtered by tenant ID) then Typesense. Otherwise if Typesense can’t fit the index in your RAM it will not work, if you need to filter every search it will become slow. If you need lots of customizability of how you index your documents and how you search, what you prioritize in the search, facets, nothing beats Elastic here. But it will need plenty of resources or otherwise it will be slow. If you need fast but absolutely non customizable search that can live off a lot less than 1GB of RAM (less than 100MBs even) then you might have some success with https://github.com/valeriansaliou/sonic If you’re constrained on resources but sonic is too limiting, then finally you might have some success with Manticore search. It’s featureful but using it with different languages can be a lot of work (I’m not sure why they don’t ship a distribution with all language plugins enabled and configured), the docs will be good enough to get you started. It can live off a few hundred MBs of RAM with large indexes, and will still be faster than Elastic.
If you're in the market for lightweight but fast search engines, I would recommend you take a look to typesense [1], instead; or even sonic [2], if it fits your use case. MeiliSearch does not give you anything on top of them (i.e. neither as feature complete as [1], not as fast as [2]).

And I personally stopped using them after a really bad experience I had with their "developers". They don't really care about you and it shows, also, they were kind of rude when I reported some bugs to them.

I moved to typesense and it's a whole different world, their creators truly enjoy that you're using their product; same thing with sonic, Valerian is the kind of hacker you'd want as a friend, super talented, super easy going, you could ask a completely dumb question on their GH and he takes the time to explain things to you at length. I know its open source, I know I didn't pay a dime, but for me, that kind of attitude makes it or break it. Plus, you actually get a superior product.

1: https://typesense.org/

2: https://github.com/valeriansaliou/sonic

There are also two good Elasticsearch alternatives in Rust - Sonic[1] and Toshi[2].

[1] https://github.com/valeriansaliou/sonic

[2] https://github.com/toshi-search/Toshi

Typesense seems like a good fully-featured alternative to Elasticsearch. I.e. it's basically a database with fuzzy-search features (schemas, fields, facets, ordering, scoring profiles, etc), and its speed is enabled by holding everything in RAM.

If you just want the fuzzy-search part (query string -> list of matching document ids) and don't want to pay for GBs of RAM, sonic [1] seems to be an interesting project. It's very fast (μs) and uses very little RAM but doesn't offer DB-like features such as sorting, schemas/fields, scoring etc. It's more of a low-level primitive for building your own search feature than an integrated search db that's ready to use out of the box.

[1]: https://github.com/valeriansaliou/sonic

They also could sponsor Elasticsearch alternatives in Rust - Sonic[1] and Toshi[2]. Even more, integration[3] with Vector.

[1] https://github.com/valeriansaliou/sonic

[2] https://github.com/toshi-search/Toshi

[3] https://github.com/timberio/vector/issues/988

FWIW...

1. Toshi https://github.com/toshi-search/Toshi (Rust, 3.1k stars)

2. Tantivy https://github.com/tantivy-search/tantivy (Rust, 4.5k stars)

3. PISA https://github.com/pisa-engine/pisa (C++, 486 stars)

4. Bleve https://github.com/blevesearch/bleve (Go, 7.4k stars)

5. Sonic https://github.com/valeriansaliou/sonic (Rust, 10.9k stars)

6. Partial comparison https://mosuka.github.io/search-benchmark-game/ (tantivy Vs lucene Vs pisa Vs bleve)

7. Bayard https://github.com/bayard-search/bayard (Rust, on top of Tantivy, 1.4k stars)

8. Blast https://github.com/mosuka/blast (Go, on top of Bleve, 930 stars)

Algolia alternatives with some compatibility

1. MeiliSearch https://github.com/meilisearch/MeiliSearch (Rust, 12.4k stars)

2. typesense https://github.com/typesense/typesense (C++, 5.1k stars)

I'm personally very fond of sonic [0] for full text search.

> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query....

> When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.

> At the end, we decided to build our own search backend, designed to be simple and lightweight on resources

[0] - https://github.com/valeriansaliou/sonic

> https://github.com/valeriansaliou/sonic

This is actually very interesting project. Do we have some benchmark that sonic works on huge scale with lot of data?