For Falcon specifically, this is easy, it's embedded here: https://huggingface.co/blog/falcon#demo or you can access the demo here: https://huggingface.co/spaces/HuggingFaceH4/falcon-chat

I just tested both and it's pretty zippy (faster than AMD's recent live MI300 demo).

For llama-based models, recently I've been using https://github.com/turboderp/exllama a lot. It has a Dockerfile/docker-compose.yml so it should be pretty easy to get going. llama.cpp is the other easy one and the most recent updates put it's CUDA support only about 25% slower and generally is a simple `make` with a flag depending on which GPU you support you want and has basically no dependencies.

Also, here's a Colab notebook that should let shows you run up to 13b quantized models (12G RAM, 80G disk, Tesla T4 16G) for free: https://colab.research.google.com/drive/1QzFsWru1YLnTVK77itW... (for Falcon, replace w/ Koboldcpp or ctransformers)