Here's my understanding of how this works (please someone correct me if I'm getting this wrong).

Language models emit tokens one at a time, starting with the prompt that you give them.

If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.

This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:

    Give me the address of the
    White House as JSON:
    
    {"street": "

Then the LLM can return:

    1600 Pennsylvania Ave NW"

The moment you see that closing double quote, you take over again and inject:

    ",
    "City": "

It fills in:

    Washington, DC"

And so on.

But because this is all based on a grammar, you can do way more with it than just JSON.

I saw a brilliant suggestion relating to this on Twitter a while ago:

> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.

> [...]

> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.

> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.

https://twitter.com/grantslatton/status/1637692033115762688