What does HackerNews think of ggllm.cpp?

How Is LLaMa.cpp Possible? | Aug 2023

It doesn't support Falcon right now, but there's a fork that does (https://github.com/cmp-nct/ggllm.cpp/).

Alfred-40B, an OSS RLHF version of Falcon40B | Aug 2023

Pretty much anything with 32GB (?) total RAM+VRAM:

But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.

Show HN: Danswer – Open-source question answering across all your docs | Jul 2023

Expand Context ↕

The GGLLM fork seems to be the leading falcon winner for now [1]

It comes with its own variant of the GGML sub format "ggcv1" but there's quants available on HF [2]

Although if you have a GPU I'd go with the newly released AWQ quantization instead [3] the performance is better.

(I may or may not have a mild local LLM addiction - and video cards cost more then drugs)

[1] https://github.com/cmp-nct/ggllm.cpp

[2] https://huggingface.co/TheBloke/falcon-7b-instruct-GGML

[3] https://huggingface.co/abhinavkulkarni/tiiuae-falcon-7b-inst...

Falcon LLM – A 40B Model | Jun 2023

Expand Context ↕

Experimental Falcon inference via ggml (so on CPU): https://github.com/cmp-nct/ggllm.cpp

It has problems but it does work