>Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)

This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position

There's this that can differentiate speakers

https://github.com/CorentinJ/Real-Time-Voice-Cloning