I'll be very curious if the scientific computing community will find uses of 'tensor cores'[1] for more than GEMMS. As this paper indicates, there may be more cases in HPC that can efficiently leverage small matrix multiplication instructions. This might be particularly so as NVIDIA ampere claims to have extended support to single precision, instead of just precisions of interest to the machine learning community.

[1] Tensor cores is a poor name, likely from marketing, as the units really only compute fixed, small matrix sizes.

I'm also wondering, how many people are using a NumPy replacement that uses the GPU?

SuiteSparse doesn't support GPUs just yet, but when it does, you can do this with pygraphblas:

https://github.com/michelp/pygraphblas

(I am the pygraphblas author)

That's cool, but I was thinking more of a drop-in replacement for NumPy / SciPy, like (I just searched for it) CuPy [1]

It would be nice if people could share some experiences.

[1] https://towardsdatascience.com/heres-how-to-use-cupy-to-make...

Numpy (and CuPy) provide dense matrices. They're super awesome and certainly very useful for many kinds of problems, but they are not useful for storing adjacency matrices for sparse graphs. That is the point of the paper and the purpose of SuiteSparse and The GraphBLAS.

Dense matrices are great, and their implementation is straightforward, a dense chunk of memory contains every element in the matrix, for an N sided square matrix, the storage requirement is N squared. Finding an element is a simple matter of indexing math. For large adjacency matrices, this is horribly inefficient, and the bigger the graph gets the worse the cache and memory locality as most elements end up being zero.

Hypersparse graphs, like say a large social network, may only have a few hundred billion edges, but trying to fit that in a dense adjacency matrix means requiring quadrillions of mostly empty elements. This is clearly impossible, so sparse matrices are required to store a large graph.

cuPy's sparse matrix support is still more limited than its dense functionality, but it's expanding quickly in version 8.0 in particular: https://docs.cupy.dev/en/stable/reference/sparse.html

The C++/CUDA backend to cuGraph contains many low-level graph operations on really sparse graph structures as well: https://github.com/rapidsai/cugraph