the more it goes, the more I realize that the true power of LLMs is not in unstructured text that they can generate, but in structured output. but there are two approaches to achieve this:

1. LMQL/guidance/JSONformer/OP's post

2. finetuning the model to understand function calls and their (potentially) JSON schemas.

there was a comment here about OpenAI's approach (finetuning a model to understand function call) which raised a good point: since finetuning is often forgetful (previous knowledge learnt by the model gets forgotten a little bit), it's not clear if OpenAI's approach has made GPT-4 less capable than it was before. Not to mention that you're still dealing with a statistical process (LLM), not a locked-in algorithm that generates the desired schema 100% the time.

Which brings me to the other approach: steering the LLM's output __as it is generating tokens__, which is what LMQL does. This results in less token usage (you don't send function schema as part of your prompt/message to OpenAI) and 100% accuracy because token probabilities are modified (e.g., 0% chance of any character except ":" after a double quotation mark).

> Which brings me to the other approach: steering the LLM's output __as it is generating tokens__

A relevant PR:

https://github.com/ggerganov/llama.cpp/pull/1773

The plan is to support arbitrary grammar files to constrain token generation, similar to the grammar files here:

https://github.com/antlr/grammars-v4