What does HackerNews think of ggllm.cpp?
Falcon LLM ggml framework with CPU and GPU support
Language:
C
It doesn't support Falcon right now, but there's a fork that does (https://github.com/cmp-nct/ggllm.cpp/).
Pretty much anything with 32GB (?) total RAM+VRAM:
https://github.com/cmp-nct/ggllm.cpp
But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.
The GGLLM fork seems to be the leading falcon winner for now [1]
It comes with its own variant of the GGML sub format "ggcv1" but there's quants available on HF [2]
Although if you have a GPU I'd go with the newly released AWQ quantization instead [3] the performance is better.
(I may or may not have a mild local LLM addiction - and video cards cost more then drugs)
[1] https://github.com/cmp-nct/ggllm.cpp
[2] https://huggingface.co/TheBloke/falcon-7b-instruct-GGML
[3] https://huggingface.co/abhinavkulkarni/tiiuae-falcon-7b-inst...
Experimental Falcon inference via ggml (so on CPU): https://github.com/cmp-nct/ggllm.cpp
It has problems but it does work