Is it possible to do inference with Falcon 40B on this type of hardware or similar?
Yes, but would be massive overkill. Falcon 40B takes ~35GB of VRAM to load now, and probably less in the future with better quant from llama.cpp and such.
https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ
Large context size is becoming less of an issue now too.
But maybe it would be good for batched inference?
I mean the 35GB version of Falcon is maybe not something you'd want to use in production
Also ironically, this version of Falcon will require CUDA.
It might work on rocm? I am not sure about the status of GPTQ on rocm.
GPTQ for LLaMAs w ROCm works with https://github.com/turboderp/exllama/ but Falcon inferencing is a different beast.