GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.
C is very performant, and hard to code in, so people just use python as a abstraction layer over C.
Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.
As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?
Has this been done? Is it possible?