How good is the speech-to-text in practice? I have found that opensource s2t models are extremely far from Google/Apple ones unfortunately, and make a complete assistant really frustrating to use.
Did you try Vosk https://github.com/alphacep/vosk-api or Wenet https://github.com/wenet-e2e/wenet