I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.
I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.
Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!
https://github.com/LostRuins/koboldcpp
The biggest catch is it doesn't support llama.cpp's continuous batching yet. Maybe soon?