Actually the multicast audio is a pretty good system. There're multiple elevators. Broadcasting makes it simple to sync up the music on them. Using a Wifi speaker has much lower cost than adding a wired speaker to a moving elevator. It’s also simple and low cost to add extra WiFi speakers in other areas of the hotel, creating an ad hoc PA system.
Remember that networks introduce latency. It might be tiny but the human ear can detect speakers being _slightly_ off.
For example you wouldn't want a wifi speaker in an elevator using a repeater at the top of the shaft trying to match up to a hardwired speaker in a ground floor vestibule.
You can use NTP to get the devices' clocks synced up to much better than necessary tolerance, and play back accordingly.
And then you "just" have the same problems that you have with purely electrically connected, analogue speakers (which are effectively 100% in sync in terms of receiving the signal): Sound is relatively slow, and so the audio from a speaker that is far away will reach you later than the nearby speaker.
You can mitigate that by adding a precise delay to the far away speaker... but of course that does not work if you're standing on the other side. Nevertheless, as said, that problem is regardless of whether your speaker is network-connected or not.
Kind of. The bigger problem you will have if you try this is that the audio is not clocked by the system clock, and the audio clock is almost always free-running (and even if it were derived from the system clock, NTP et al don't generally discipline the clock itself, just the OS's presentation of it). So in the case of a long running playback (or continuous, as in this case), you will drift out of sync over time, and it doesn't take that long to become noticeable. And at some point you'll either start dropping out due to either buffer underflow or buffer overflow. So you do still need to take care about this.
So to work well you do need to resync the audio to the local audio clock using a sample rate converter, or build some custom hardware that lets you sync the playback audio clocks somehow. Or if you want to be sloppy about it, keep close track and stuff or drop individual samples as you drift.
But yeah, this is all more or less 'solved'.
Sonos has a remarkably good implementation of all of this.
For URL-based streams they buffer and NTP to sync. For live streams (e.g. gaming) they p2p multicast and tweak the wifi params in real-time to minimize drops.
The speakers create their own wifi and use MST network heuristics to latency-min route over that versus native wifi or ethernet if you've plugged it in. Sound drops when the wifi spectrum blinks (rarely), but I have never encountered the speakers being out of sync or noticing an echo effect.
And the speakers can use your phone's mic to scan the soundscape of a room to acoustically balance the sound when you set them up. I particularly like how consistent the sound volume is room-to-room even with very different speaker setups.
IIRC they've patented their specific mechanism. So ya, it's solved, but it may be expensive to license.
(Not affiliated with Sonos, I just have a bunch of them and like them a lot.)
If you are just interested in the synchronized Audio-over-Ethernet part, AES67 is the industry standard, and a pretty complete open-source implementation can be found at https://github.com/bondagit/aes67-linux-daemon , though AES67 is itself a composition of existing standards, fundamentally it is mostly composed of SDP for sessions description, RTP for media, and PTP for clock sync, so you can build that out of a variety of implementations too.
For room correction you can look at https://drc-fir.sourceforge.net/ to generate FIR filter coefficients, then you can apply it in realtime with https://github.com/wwmm/easyeffects or https://github.com/HEnquist/camilladsp .
Of course some people just want it to work, then you can shell out for Sonos :p.