While doing my PhD some years ago (it wasn't a PhD on AI, but very much related) I trained several models with the usual stack back then (pytorch and some others in TF). I realized that a lot of this stack could be rewritten in much simpler terms without sacrificing much fidelity and/or performance in the end.

Submissions like yours and other projects like this one (recently featured here as well) -> https://github.com/ggerganov/whisper.cpp, makes it pretty clear to me that this intuition is correct.

There's a couple tools I created back then that could push things further towards this direction, unfortunately they're not mature enough to warrant a release but the ideas they portray are worth taking a look at (IMHO) and I'll be happy to share them. If there's interest on your side (or anyone reading this thread) I'd love to talk more about it.