What does HackerNews think of rum?

RUM access method - inverted index with additional information in posting lists

Language: C

Features I'd Like in PostgreSQL | Jan 2023

>Reduce the memory usage of prepared queries

Yes query plan reuse like every other db, this still blows me away PG replans every time unless you explicitly prepare and that's still per connection.

Better full-text scoring is one for me that's missing in that list, TF/IDF or BM25 please see: https://github.com/postgrespro/rum

Ask HN: Books about full text search? | Nov 2022

Expand Context ↕

for postgres, i highly recommend the rum index over the core fts. rum is written by postgrespro, who also wrote core fts and json indexing in pg.

    https://github.com/postgrespro/rum

rum handles +20mil pdf pages, interactively.

Postgres Full Text Search vs. the Rest | Oct 2022

My experience with Postgres FTS (did a comparison with Elastic a couple years back), is that filtering works fine and is speedy enough, but ranking crumbles when the resulting set is large.

If you have a large-ish data set with lots of similar data (4M addresses and location names was the test case), Postgres FTS just doesn't perform.

There is no index that helps scoring results. You would have to install an extension like RUM index (https://github.com/postgrespro/rum) to improve this, which may or may not be an option (often not if you use managed databases).

If you want a best of both worlds, one could investigate this extension (again, often not an option for managed databases): https://github.com/matthewfranglen/postgres-elasticsearch-fd...

Either way, writing something that indexes your postgres database into elastic/opensearch is a one time investment that usually pays off in the long run.

Postgres Full Text Search vs. the Rest | Oct 2022

I still look forward to RUM indexes [1] getting integrated into PG which would build the foundation for better ranking functions such as TF/IDF or BM25. PG seems to lag behind here and hasn't be a lot of movement in a while

1. https://github.com/postgrespro/rum

Postgres Full-Text Search: A Search Engine in a Database | Jul 2022

Mandatory mention of the RUM extension (https://github.com/postgrespro/rum) if this caught your eye. Lots of tutorials and conference presentations out there showcasing the advantages in terms of ranking, timestamps...

Postgres Full-Text Search: A search engine in a database | Jul 2021

Expand Context ↕

You might be just fine adding an unindexed tsvector column, since you've already filtered down the results.

The GIN indexes for FTS don't really work in conjunction with other indices, which is why https://github.com/postgrespro/rum exists. Luckily, it sounds like you can use your existing indices to filter and let postgres scan for matches on the tsvector. The GIN tsvector indices are quite expensive to build, so don't add one if postgres can't make use of it!

Postgres Full-Text Search: A Search Engine in a Database | Jul 2021

Expand Context ↕

That why this really needs to get merged:

https://github.com/postgrespro/rum

Postgres Full-Text Search: A Search Engine in a Database | Jul 2021

Keep wondering if RUM Indexes [1] will ever get merged for faster and better ranking (TF/IDF). Really would make PG a much more complete text search engine.

https://github.com/postgrespro/rum

Postgres Full-Text Search: A Search Engine in a Database | Jul 2021

My experience has been that sorting by relevance ranking is quite expensive. I looked into this a bit and found https://github.com/postgrespro/rum (and some earlier slide decks about it) that explains why the GIN index type can't support searching and ranking itself (meaning you need to do heap scans for ranking). This is especially problematic if your users routinely do searches that match a lot of documents and you only want to show the top X results.

Edit: if any of the Crunchy Data people are reading this: support for RUM indexes would be super cool to have in your managed service.

Debugging random slow writes in PostgreSQL | May 2021

We have been bitten by the same behavior. I gave a talk with a friend about this exact topic (diagnosing GIN pending list updates) at PGCon 2019 in Ottawa[1][2].

What you need to know is that the pending list will be merged with the main b-tree during several operations. Only one of them is so extremely critical for your insert performance - that is during actual insert. Both vacuum and autovacuum (including autovacuum analyze but not direct analyze) will merge the pending list. So frequent autovacuums are the first thing you should tune. Merging on insert happens when you exceed the gin_pending_list_limit. In all cases it is also interesting to know which memory parameter is used to rebuild the index as that inpacts how long it will take: work_mem (when triggered on insert), autovacuum_work_mem (when triggered during autovauum) and maintainance_work_mem (triggered by a call to gin_clean_pending_list()) define how much memory can be used for the rebuild.

What you can do is:

- tune the size of the pending list (like you did)

- make sure vacuum runs frequently

- if you have a bulk insert heavy workload (eg. nightly imports), drop the index and create it after inserting rows (not always makes sense business wise, depends on your app)

- disable fastupdate, you pay a higher cost per insert but remove the fluctuation when the merge needs to happen

The first thing was done in the article. However I believe the author still relies on the list being merged on insert. If vacuums were tuned aggressively along with the limit (vacuums can be tuned per table). Then the list would be merged out of bound of ongoing inserts.

I also had the pleasure of speaking with one main authors of GIN indexes (Oleg Bartunov) during the mentioned PGCon. He gave probably the best solution and informed me to "just use RUM indexes". RUM[3] indexes are like GIN indexes, without the pending list and with faster ranking, faster phrase searches and faster timestamp based ordering. It is however out of the main PostgreSQL release so it might be hard to get it running if you don't control the extensions that are loaded to your Postgres instance.

[1] - video https://www.youtube.com/watch?v=Brt41xnMZqo&t=1s

[2] - slides https://www.pgcon.org/2019/schedule/attachments/541_Let's%20...

[3] - https://github.com/postgrespro/rum

Show HN: Full text search Project Gutenberg (60m paragraphs) | Jan 2021

Expand Context ↕

I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.

Old, Good Database Design | Sep 2020

Expand Context ↕

for text search the rum index is quite robust. my guess is that rum (or something like it) will be introduced into the pg core soon. we index many terabytes of pdf files excellent performance.

   https://github.com/postgrespro/rum

also, postgrespro are behind the json/b indexing.

Algolia introduces pay-as-you-go pricing for search | Jul 2020

Expand Context ↕

I wish some more effort would be put into Postgres full text search seems to have stagnated somewhat, it does a lot but is also lacking compared to say Lucene.

Not that its going to be as good as Algolia but it could be a whole lot better for many use cases specifically TF/IDF and BM25 scoring:

https://github.com/postgrespro/rum

The state of full text search in PostgreSQL 12 | Feb 2020

Expand Context ↕

Would like to see this work integrated: https://github.com/postgrespro/rum

The state of full text search in PostgreSQL 12 | Feb 2020

Expand Context ↕

PG could really do most of it if they got serious about https://github.com/postgrespro/rum with TF/IDF ranking.

Fine tuning full text search with PostgreSQL 12 | Nov 2019

be sure to investigate the new "rum" index for text search, which is not in pg core. rum is written by the same folks who wrote the core fts. stunningly fast and flexible. also, a query is a first class data type, allowing for a simple, elegant classification: find all queries which match a certain document.

https://github.com/postgrespro/rum

Challenges in Implementing a Full-Text-Search Engine | Sep 2019

Expand Context ↕

PG really needs better ranking at least TF-IDF but also BM25.

There has been some work but not sure when it will be stable, needs a new kind of index:

https://github.com/postgrespro/rum

Ask HN: Are Lucene/Solr/ES Still Used for Search? | Jul 2019

Expand Context ↕

Ah, this is good to know. My site doesn't yet need to scale, so this is definitely A Problem I Would Love To Have ;)

EDIT: This seems to help with the ranking problem: https://github.com/postgrespro/rum

Ask HN: Are Lucene/Solr/ES Still Used for Search? | Jul 2019

Expand Context ↕

https://github.com/postgrespro/rum

TF-IDF in a Nutshell | Apr 2019

Still waiting for PostgreSQL to implement TF-IDF ranking:

https://github.com/postgrespro/rum

PostgreSQL 11 and Just in Time Compilation of Queries | Sep 2018

Expand Context ↕

My understanding is gin/gist indexes do not store enough information to use better scoring and RUM indexes could change that:

https://github.com/postgrespro/rum

Setting Up a Fast, Comprehensive Search Routine with PostgreSQL | Jul 2018

the biggest issue with postgres search is the inability to use TF-IDF or BM25 (the current default and state of the art on elasticsearch). The current ranking system is not very relevant.

Anyone who is familiar with PG internals - is there something in the internal data structure that prevents a BM25 or TF-IDF style rank generation ?

The work on RUM seems to have stagnated (and TF-IDF was a todo anyhow here) https://github.com/postgrespro/rum

I have a theory that if they incorporate these algorithms, it makes postgres potent enough that a lot of people may choose not to use elasticsearch/lucene.

Migrating from RethinkDB to Postgres – An Experience Report | Sep 2017

Expand Context ↕

Thanks for following up :-)

I was asking because ranking can be slow in PostgreSQL. PostgreSQL can use a GIN or GiST index for filtering, but not for ranking, because the index doesn't contain the positional information needed for ranking.

This is not an issue when your query is highly selective and returns a low number of matching rows. But when the query returns a large number of matching rows, PostgreSQL has to fetch the ts_vector from heap for each matching row, and this can be really slow.

People are working on this but it's not in PostgreSQL core yet: https://github.com/postgrespro/rum.

This is why I'm a bit surprised by the numbers you shared: fulltext search on 270 million rows in below 65ms on commodity hardware (sub 8€/mo).

A few questions, if I may:

- What is the average number of rows returned by your queries? Is there a LIMIT?

- Is the ts_vector stored in the table?

- Do you use a GIN or GiST index on the ts_vector?

Cheers.

Is PostgreSQL good enough? | Feb 2017

Expand Context ↕

Used it for 10,000+ document and 1 million document projects. Was good enough for me (but I guess I wouldn't have written the article if it was bad).

I've also used Elasticsearch, and I reckon that's pretty damn amazing.

Anyone wanting more in-depth information should read or watch this FTS presentation from last year. It's by some of the people who has done a lot of work on the implementation, and talks about 9.6 improvements, current problems, and things we might expect to see in version 10. https://www.pgcon.org/2016/schedule/events/926.en.html

There's also some previous presentations on the same topic which are interesting. You can see the RUM index (which has faster ranking here): https://github.com/postgrespro/rum

Postgres full-text search is Good Enough (2015) | Oct 2016

Expand Context ↕

The PostgreSQL team is working on some of the weak spots. Please have a look a the new 'RUM' index, that should improve ranking:

https://www.pgcon.org/2016/schedule/attachments/436_pgcon-20...

https://github.com/postgrespro/rum

Postgres full-text search is Good Enough (2015) | Oct 2016

Expand Context ↕

And this one a few days ago:

https://news.ycombinator.com/item?id=12605156

And this new index type for PostgreSQL:

https://github.com/postgrespro/rum