Is it possible to do inference with Falcon 40B on this type of hardware or similar?

Yes, but would be massive overkill. Falcon 40B takes ~35GB of VRAM to load now, and probably less in the future with better quant from llama.cpp and such.

https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

Large context size is becoming less of an issue now too.

But maybe it would be good for batched inference?

zwaps

I mean the 35GB version of Falcon is maybe not something you'd want to use in production

Also ironically, this version of Falcon will require CUDA.

brucethemoose2

It might work on rocm? I am not sure about the status of GPTQ on rocm.

lhl

GPTQ for LLaMAs w ROCm works with https://github.com/turboderp/exllama/ but Falcon inferencing is a different beast.