It's really interesting that these models are written in Python. Anyone know how much of a speed up using a faster language here would have? Maybe it's already off-loading a lot of the computation to C (I know many Python libraries do this), but I'd love to know.

Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.

Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.

There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/