As you say, there are a ton of "minipcs" on the market that directly compete with the Raspberry Pi on cost and power usage. They're typically slightly larger but the expansion options (bring your own RAM/storage) plus real I/O (with real PCIe), disk, etc IMO significantly outweighs this. They're also typically more performant and while aarch64 platform support is increasing dramatically there are still the occasions where there's a project, docker container, etc that doesn't support it.
Taking it a step further, there are a TON of decommissioned/recycled corporate/enterprise SFF desktops on the market. They don't compete in terms of size (13" x 15" or so) but they can actually get close in power usage. Many of them have multiple SATA ports, real NVMe, multiple real half-height PCIe slots, significantly better USB and PCIe bandwidth, etc.
With my project Willow and Willow Inference Server[0] we're trying to drive this approach in the self-hosting community with an initial emphasis on Home Assistant. They're generally sick of Raspberry PI supply shortages, very limited performance, poor I/O, flaky SD cards, etc. The Raspberry Pi is still pretty popular for "my first Home Assistant" but generally once people get bitten by the self-hosting bug they end up looking more like homelab very quickly.
For Willow particularly we emphasize use of GPUs because a voice assistant can't be waiting > 10 seconds to do speech recognition and speech synthesis. There are approaches out there trying to kind of get something working using Whisper tiny but in our ample internal testing and community feedback we feel that Whisper small is the bare minimum for voice assistant tasks, with many users going all out and using Whisper large-v2 at beam size 5. With GPU it's still so fast it doesn't really matter.
The Raspberry Pi is especially poorly suited for this use case (and even amd64). We have some benchmarks here[1]. TLDR a ~seven year old Tesla P4 (single slot, slot power only, half-height, used for $70) does speech recognition 87x faster, with the multiple increasing for more complex models and longer speech segments. A 3.8 second voice command takes 586ms on the Tesla P4 and 51 seconds on the Raspberry Pi 4. Even with the Pi 5 being twice as fast that's still 25 seconds, which is completely unusable. Not fair to compare GPU to Raspberry Pi but consider the economics and practicality...
You can get an SFF desktop and Tesla P4 from eBay for $200 shipped to your door. It will idle (with GPU and models loaded) at ~30 watts. The CPU, RAM, disk (NVMe), I/O, etc will walk all over a Raspberry Pi anything. Add the GPU and obviously it's not even close - you end up with a machine that can easily do 10x-100x what a Raspberry Pi can do for 2x the cost and power usage. You can even throw a 2.5gb Ethernet card in another slot for $20 and replace your router if you want to go really dense.
Even factoring in power usage (10-15w vs 30, 2-3x) the cost difference comes down to nearly nothing and for many users this configuration is essentially future-proof to anything they may want to do for many years (my system with everything running maxes out around 50% of one core). Many also gradually grew their self-hosted situation over the years with people ending up with three or more Raspberry Pis for different tasks (PiHole, Home Assistant, Plex, etc). At this point the SFF configuration starts to pull far head in every way including power usage.
Users were initially very skeptical to GPU use, likely from taking their experience in the desktop market and assuming things like "300 watt power usage with a huge > $500 card". Now they love having a GPU around for Willow and miscellaneous other CUDA tasks like encoding/decoding/transcoding with Plex/Jellyfin, accelerated Frigate, and all kinds of other applications. Willow Inference Server (depending on configuration) uses somewhere between 1-4GB of VRAM so with an 8GB VRAM card that leaves for plenty of additional tasks. We even have users who started with the Tesla P4 and then got the LLM bug and figured out how to get an RTX 3090 working with their setup which also of course leads to absurd performance with Willow - my local RTX 3090 goes from end of speech to command completion in HA to TTS feedback in ~250ms. It's "speak, blink, done" fast.
[0] - https://heywillow.io/
[1] - https://heywillow.io/components/willow-inference-server/#ben...
The issue (among others) is we achieve the speech recognition performance we do largely thanks to ctranslate2[0]. They've gone on the record saying that they essentially have no interest in ROCm[1].
Of course with open source anything is possible but we see this as being one of several fundamental issues in supporting AMD GPGPU hardware.