Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?

I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.

I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.

I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.

Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!

Koboldcpp has an OpenAI (and kobold api) endpoint now, and supports grammar syntax like you said:

https://github.com/LostRuins/koboldcpp

The biggest catch is it doesn't support llama.cpp's continuous batching yet. Maybe soon?