Without a fully connected NVLink network, the 3090s will be underutilized for models that distribute the layers across multiple GPUs.

If AMD were better supported, it would be most economical to use 4x MI60s for 128GB using an Infinity Fabric bridge. However, in order to get to the end of such a journey, you would have to know something.

The bifurcated risers mean that some of the cards are only running at x8 pcie speed as well, and they mention they are only working with pcie-3 not 4.

This would severely limit training using model parallelism.

For data parallel where the full model fits on each card and the batch size is just increased it wouldn't matter as much, and maybe that is the primary use for this.

I wonder how this is dealt with on vast.ai rentals. Because there is a huge difference if I needed 7x 3090's where I need all 168GB to load weights on a single giant LLM model vs. just wanting to run 4GB Stable Diffusion in parallel inference with a massive batch size....

It was primarily being used to train TTS models (see https://github.com/neonbjb/tortoise-tts), which largely fit into a single GPUs memory. So, for data parallelism, x8 PCIe isn't that much of a concern.