With tools like alass[1] (using it to synchronise against the original language subtitles) it is about as close to solved you can get.

All of the attempts I've seen of using audio information to synchronise subtitles have been awful. One issue is that some languages subtitle everything, even screams and incoherent shouts (such as Japanese) while others only subtitle dialogue and often rework dialogue for the purposes of making the subtitles short enough to be readable easily. It feels like you need too much domain knowledge to know how different languages subtitle things and that subtitles that match the general meaning of what is being said should be matched up.

[1]: https://github.com/kaegi/alass