I'm admittedly not at all familiar with typeorm (not really great at Node either for that matter), but a few questions:
What kind of testing have you done on app/DB performance under heavy load?
Why are you indexing nearly the same thing thing twice in the AppConnection table? [0]
What kind of column is your pkey, as defined in BaseEntity? [1] It says String, but AFAIK that's not a Postgres type.
Excited to see more work in this space!
Disclaimer: I'm a DBRE at Zapier.
[0]: https://github.com/activepieces/activepieces/blob/main/packa...
[1]: https://github.com/activepieces/activepieces/blob/main/packa...
The development began two months ago and we have not encountered any scaling issues yet, as the majority of users self-host. Therefore, our priority was focused on building apps. Is there anything you believe we should consider?
You are correct, repeating the same index twice is a mistake. Thanks for hinting at that
We are using nano id (https://www.npmjs.com/package/nanoid) for all entities, It's stored as varchar in the database.
There are pros and cons to using random IDs as a PK. For RDBMS clustering on the PK (InnoDB), it's a terrible idea. If you're going to sort by the PK, it's usually a terrible idea (UUIDv1 isn't as bad since it includes the timestamp, but that assumes your access pattern is based on insertion time). There is ULID [0] if you'd like something that's sortable. You could also just have a secondary index. An advantage can be that it _can_ be a good way (again, this depends heavily on your access patterns) to do sharding.
My other concern for nano id is twofold, both around their PRNG. They mention using Node's crypto.randomBytes(), but their source code instead references crypto.randomFill() [1]. Node's docs mention that having "surprising and negative performance implications for some applications" [2], related to libuv's thread pool. See my later comment about libuv and containers. Also, Node's crypto.randomBytes() mentions that it "will not complete until there is sufficient entropy available." That sounds suspiciously like they're using `/dev/random` instead of `/dev/urandom`, which at least for this application of it, would be an odd decision. I did note that elsewhere in nano id, they're creating their own entropy pool, so it may not matter either way.
With that out of the way:
If the plan is only for self-hosting, then yeah, you don't really need to consider schema design that carefully. Databases are really good at their job. Also, honestly nearly none of this matters until you have significant scale.
If you plan on starting a SaaS, there's a lot to consider. An incomplete list, in no particular order:
* Foreign keys. They're very handy, but they can introduce performance problems with some access patterns. Consider indexing child table FKs (but not always - benchmark first).
* DDL like ALTER TABLE. I suggest getting intimately familiar with Postgres' locks [3]. The good news is that instant ADD COLUMN with {DEFAULT, NOT NULL} is safer now. The bad news is that it does so by lazy-loading, so if your queries are doing silly things like SELECT *, you're still going to end up a ton of contention.
* Connection pooling. You don't want to eat up RAM dealing with connections. PgBouncer [4] and Pgpool-II [5] are two that come to mind, but there are others as well. The latter also handles replication and load balancing which is nice. If you aren't using that, you'll need to handle replication and load balancing on your own.
* Load balancing. HAProxy [6] is good for load balancing, but has its own huge set of footguns. Read their docs [7]. A few things that come to mind are:
* Any kind of abstraction away from the CPU, like containers, may cause contention. Same with VMs (i.e. EC2), for that matter, since a noisy neighbor can drop the single-core turbo of Xeons A LOT. Look into CPU pinning if possible.
* HAProxy really likes fast clocks over anything else, for x86. Xeons will beat Epyc. ARM can beat x86 if tuned correctly.
* If you're using Kubernetes, look into Intel's CPU Management [8], which is also now native in K8s v1.26 [9].
* Overall for containers, learn about cgroups. Specifically, how they (both v1 and v2) expose the `/proc` filesystem to applications. Then at how your application is detecting that for any kind of multiprocessing. Hint: Node [10] uses libuv, which is calling `/proc/cpuinfo` [11].* If you have access to the disk (e.g. you're running bare metal or VMs with this capability), think carefully about the filesystem you use and its block size (and record size, if you use ZFS).
Good luck!
[0]: https://github.com/ulid/spec
[1]: https://github.com/ai/nanoid/blob/main/async/index.js#L5
[2]: https://github.com/nodejs/node/blob/main/doc/api/crypto.md#c...
[3]: https://www.postgresql.org/docs/current/explicit-locking.htm...
[4]: https://www.pgbouncer.org/
[5]: https://www.pgpool.net/mediawiki/index.php/Main_Page
[7]: https://cbonte.github.io/haproxy-dconv/2.4/configuration.htm...
[8]: https://networkbuilders.intel.com/solutionslibrary/cpu-pin-a...
[9]: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
[10]: https://github.com/nodejs/node/blob/main/src/node_os.cc#L100
[11]: https://github.com/libuv/libuv/blob/v1.x/src/unix/linux.c#L8...*