I think python is dominant in AI, because the python-C relationship mirrors the CPU-GPU relationship.

GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.

C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.

I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?

As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?

Has this been done? Is it possible?

You might be interested in https://github.com/HigherOrderCO/HVM