In terms of building something that's usable (considering cost, speed, scale, etc) if comparing an OpenAI API call to these, it's difficult for me to see a current path where these have any viable application outside some niche scenario.

From what I understand, even to run locally you/your team needs to be able to afford a machine with a 4090. These are super expensive in some countries.

I played around with the smaller Llama/Alpaca models and it wasn't really viable to build anything with.

Not really seeing a use-case for fine-tuning either compared to just few-shot prompting.

Can someone fill me in on what I'm missing? It feels like I'm out of the loop

I'm running Vicuna on a free 4core Oracle VPS, and it's perfectly usable for a Discord bot. Responses rarely take more than 15 seconds with <256 max token limit, and the responses are much more entertaining than GPT 3.5. I'm not using the streaming API my server software[0] offers, but if I did it would probably load somewhere between the speeds of GPT-3.5 and GPT-4. It's more or less the same time a human would take to compose the same message.

So... not exactly a serious use-case. But it's what I'm using, and now I'm saving 10s of dollars on inferencing costs per month!

[0] https://github.com/go-skynet/LocalAI

I'm also using this to improve acceleration - https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...