What does HackerNews think of CTranslate2?
Fast inference engine for Transformer models
Language:
C++
#30
in
C++
#33
in
Deep learning
We'd love to move beyond Nvidia.
The issue (among others) is we achieve the speech recognition performance we do largely thanks to ctranslate2[0]. They've gone on the record saying that they essentially have no interest in ROCm[1].
Of course with open source anything is possible but we see this as being one of several fundamental issues in supporting AMD GPGPU hardware.
The original Whisper implementation from OpenAI uses the PyTorch deep learning framework. On the other hand, faster-whisper is implemented using CTranslate2 [1] which is a custom inference engine for Transformer models. So basically it is running the same model but using another backend, which is specifically optimized for inference workloads.
Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.
Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.
There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/