Yes, exactly. We do forced alignment when you edit your transcript. The new words don't have any timestamps, so we need to align them. For short sections we use interpolation. If we need align whole sections we use Gentle[^1].
I would also recommend comparing it to Google’s and maybe MS/Azure’s services.
Aside from the fidelity of the transcription itself, and the accuracy of disambiguating the different speakers, I’m also not convinced that all of these services will give you an end timestamp (start timestamps are sometimes there, but not necessarily for every word or sentence).
You could try a multi prong approach by doing the transcript to get text and speakers only (using the services mentioned above), and then using an aligner such as “gentle” [0] to find the start / end times.
You will have some gaps and wrongly transcribed words, but it may be a start..!
Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.
If you are more interested in recognition and alignment to a script, "gentle" is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.
For a decent performing deep model, check into Mozilla's version of Baidu's DeepSpeech [4].
If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].
[0] how I use pocketsphinx to get phonemes, https://github.com/kastnerkyle/ez-phones
[1] https://github.com/cmusphinx/pocketsphinx-python
[2] https://github.com/lowerquality/gentle
[3] how I use gentle for foreced alignment, https://github.com/kastnerkyle/raw_voice_cleanup/tree/master...