If you already have the transcript without timestamps (e.g. for an audiobook where you know the source text), you could use https://github.com/readbeyond/aeneas , which infers the timestamps by aligning text-to-speech output with the audio using dynamic time warping.

If you don't have the transcript, you'd use a transcription service that also gives you timestamps. E.g. there was a frontpage submission yesterday where someone used AWS Transcription to count the number of words in each minute of a talk: https://news.ycombinator.com/item?id=21635939