Interesting quote: "we decided to compute all statistics on demand. This was something we previously tried in RethinkDB, but the results were not good... When we tried implementing statistics as SQL queries in Postgres, we were amazed by the performance. We could implement complex statistics involving data from many tables."
I think standard line "use right tool for the job" is still the ultimate answer. Data in most applications is relational, and you need to query it in different ways that weren't anticipated at the beginning, hence the longevity of SQL.
That said, I too often see HN commentators say something like "this data was only 100 GB? Why didn't they just put it in Postgres?" which is not as clever as the writer may think. Try doing text search on a few million SQL rows, or generating product recommendations, or finding trending topics... Elasticsearch and other 'big data' tools will do it much quicker than SQL because its a different category of problem. It's not about the data size, it's about the type of processing required. (Edited my last line here a bit based on replies below.)
We do exactly what you describe on about 80TB dataset stored across a number of PG instances. The 100GB comments are extremely clever because running a query against a dataset that fully fits in RAM will be blazingly fast. "Try doing text search on a few million SQL rows" we are doing it on billions of rows.
Do you have FULLTEXT indexes on those rows? I find it hard to believe that searching something like e.g. a fragment of text in Youtube comments can be as fast in a SQL system (even in RAM) as in Elasticsearch.
As for the other stuff I mentioned (recommendations, etc.) I'm not just basing it on my personal experience--here's a write-up from Pinterest about having to dump all their MySQL data to Hadoop to drive various types of analysis. I doubt they would do it it if just putting the SQL DBs in RAM was adequate! https://medium.com/@Pinterest_Engineering/tracker-ingesting-...
I'm doing fulltext search on 270 million rows in postgres (with textvector indices) in below 65ms on commodity hardware (sub 8€/mo).
And it works better than ElasticSearch, just due to the lower additional overhead.
How do you sort the result? Do you use ts_rank?
What is the size of your database on disk?
I use ts_rank combined with many other factors (as I also rank by time, by message type, etc).
The size of the database is a few dozen gigabytes by now, but that isn’t relevant with tsvector, only the row count has an effect on search speed.
I was asking because ranking can be slow in PostgreSQL. PostgreSQL can use a GIN or GiST index for filtering, but not for ranking, because the index doesn't contain the positional information needed for ranking.
This is not an issue when your query is highly selective and returns a low number of matching rows. But when the query returns a large number of matching rows, PostgreSQL has to fetch the ts_vector from heap for each matching row, and this can be really slow.
People are working on this but it's not in PostgreSQL core yet: https://github.com/postgrespro/rum.
This is why I'm a bit surprised by the numbers you shared: fulltext search on 270 million rows in below 65ms on commodity hardware (sub 8€/mo).
A few questions, if I may:
- What is the average number of rows returned by your queries? Is there a LIMIT?
- Is the ts_vector stored in the table?
- Do you use a GIN or GiST index on the ts_vector?
Cheers.