What does HackerNews think of aeneas?

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)

Language: Python

#61 in Linux
#42 in macOS
#79 in Python
#30 in Windows
That is funny. For audio books I'm currently working on an `epub` command for `tone` which will be able to extract text from `epub` files, e.g.:

  tone epub --format="markdown" --extract-sentences --one-file-per-chapter output-path/
As a result, you can use https://github.com/readbeyond/aeneas with the generated text / markdown files to create a json mapping file looking like this:

  {
   "fragments": [
    {
     "begin": "0.000",
     "children": [], 
     "end": "7.920",
     "id": "f000001",
     "language": "eng",
     "lines": [
      "This is the first sentence of the audio book."
     ]   
    }
  }
Since aeneas is a bit inaccurate, I'm also working on an improvement with silence detection for these mapping files.

If you are looking for something that is "ready to use", you could check out https://github.com/r4victor/syncabook or the according library https://github.com/r4victor/afaligner

If you have audio files, that are NOT audio books, the epub approach will not help you and the other comments are more helpful.

I use Aeneas[1], a set of tools to do force alignment. I found it in equal measures an amazing and a hard to navigate resource. Took me a while to set up and configure everything to the point that it was usable. But when it works, it works well.

[1] https://github.com/readbeyond/aeneas

https://github.com/readbeyond/aeneas can output SMIL e-books where each line of text is associated with the corresponding segment from the audiobook. But I don't know which e-book readers support that format.
You are close enough, so I have to respect my word. I’m not a genius, just a lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project (with some optimizations) gave me the best results. Amazing project and even better personality. Take a look at https://github.com/readbeyond/aeneas Together with espeak-ng, you can get good results for line level alignment for 108 languages.
If you already have the transcript without timestamps (e.g. for an audiobook where you know the source text), you could use https://github.com/readbeyond/aeneas , which infers the timestamps by aligning text-to-speech output with the audio using dynamic time warping.

If you don't have the transcript, you'd use a transcription service that also gives you timestamps. E.g. there was a frontpage submission yesterday where someone used AWS Transcription to count the number of words in each minute of a talk: https://news.ycombinator.com/item?id=21635939

Very interesting application. I am not sure if you guys have looked into this, but there is a Python library that can detect timestamps on word level if given the audio and transcript. It's pretty accurate for English: https://github.com/readbeyond/aeneas