I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

Whisper is great but at the point we get to kludging various things together it might start to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

[1] - https://github.com/NVIDIA/NeMo