Here's my understanding of how this works (please someone correct me if I'm getting this wrong).
Language models emit tokens one at a time, starting with the prompt that you give them.
If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.
This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:
Give me the address of the
White House as JSON:
{"street": "
Then the LLM can return: 1600 Pennsylvania Ave NW"
The moment you see that closing double quote, you take over again and inject: ",
"City": "
It fills in: Washington, DC"
And so on.But because this is all based on a grammar, you can do way more with it than just JSON.
I saw a brilliant suggestion relating to this on Twitter a while ago:
> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.
> [...]
> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.
> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.