I'd love to see this dataset used as a performance and relevance benchmark for different search engines!

That was definitely part of the original plan! I spotted two other attempts [1] [2] here using BERT and ElasticSearch respectively.

The main performance issue with the Postgres FTS approach (possibly also the others?) is ranking. Matching results uses the index, but ts_rank cannot.

Most of the time, few results are returned and the front end gets its answer in ~300ms including formatting the text for the front end (~20ms without).

However, a reasonably common sentence will return tens or hundreds of thousands of rows, which takes a minute or more to get ranked. In production, this could be worked around by tracking and caching such queries if they are common enough.

I'd love to hear from anyone experienced with the other options (Lucene, Solr, ElasticSearch, etc.) whether and how they get around this.

[1] https://news.ycombinator.com/item?id=19095963

[2] https://news.ycombinator.com/item?id=6562126 (the link does not load for me)

I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.