Inherent to any e2e encryption scheme is the question; are you talking to who you think you are talking to? In other words; are you the victim of a man in the middle attack?
So if you ever encounter a system that has the ease of use feature where you don't have to verify the identity of the other participant(s) with something like a identity fingerprint number then you already know you do not have all the protection that e2e encryption can provide. This is particularly relevant in a case like Zoom, where all the data goes through servers that Zoom controls making a MITM attack trivial.
So we really should of known that Zoom doesn't provide complete e2e encryption already just from the lack of the identity check.
Skipping the identity verification step seems to be common these days. Even Signal does that by default, but they at least make the verification of what they call "safety numbers" fairly easy and straightforward.
Added: So can true e2e encryption ever be practical for conferences involving a large number of participants? Perhaps Zoom is claiming the impossible... The issues surrounding the addition of OMEMO encryption to XMPP conferences make for an entirely relevant example. What do you do if one of the participants is not known to all the others? There are lots of possible answers to that question.
Added2: >The only feature of Zoom that does appear to be end-to-end encrypted is in-meeting text chat.
I don't see how this can be true either based on the same thinking.
Many conference calls are implemented using what's called a Selective Forwarding Unit (SFU) and the sending clients send multiple resolutions (either independent, called "Simulcast" or dependent, called "SVC"). In that case, the adaptation is done by the server in selecting which resolution to forward at any given time. This is fairly common practice in the industry. For example: https://github.com/jitsi/jitsi-videobridge and https://tools.ietf.org/html/draft-aboba-avtcore-sfu-rtp-00 and https://www.w3.org/TR/webrtc-svc/.
For those types of conference calls, the server only needs to know the sizes of the various streams and which packet is for what stream. It does not need to see the decrypted media, so one can implement e2e encryption for such types of group calls. This is less common in the industry, but is possible. For example: https://support.google.com/duo/answer/9280240?hl=en
(I used to work at Google on WebRTC, Duo, and Hangouts, but now work on video calling at Signal).