Little off topic..

I'm trying to help a friend in media industry. His requirement is ability to identify different voice in a movie and have the output time stamped - Ex. Voice A: 00.00.00sec - 00.05.30sec, Voice B: 00.05.31 - 00.06.30, etc.

It would be very helpful if anyone can point to any tools that exists that can do that (open source or otherwise).

AWS offer this for $1.44ph

Thanks. Are you referring to Transcribe? https://aws.amazon.com/transcribe/

Transcribe seems to be more for speech-to-text irrespective to who is making the speech.

Here the requirement is to identify the unique voices. Ex. if "Mary had a little lamb" is voiced by two different voices then the engine should identify Voice A said "Mary" at 00.00.00sec-00.00.01sec and then Voice B said "had a little" at 0.00.02sec-00.00.03sec, then Voice A again said "lamb".. etc.

Transcribe does “Recognize Multiple Speakers” — it’s on the page that you linked to, in the list of features.

I would also recommend comparing it to Google’s and maybe MS/Azure’s services.

Aside from the fidelity of the transcription itself, and the accuracy of disambiguating the different speakers, I’m also not convinced that all of these services will give you an end timestamp (start timestamps are sometimes there, but not necessarily for every word or sentence).

You could try a multi prong approach by doing the transcript to get text and speakers only (using the services mentioned above), and then using an aligner such as “gentle” [0] to find the start / end times.

You will have some gaps and wrongly transcribed words, but it may be a start..!

[0] https://github.com/lowerquality/gentle