The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python

Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated

  {
    "name": "bob"

At this point, guidance will take over generation control from the model to generate the next text

  {
    "name": "bob",
    "job":

If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.

Then guidance passes control back to the model for generating the next field value.

  {
    "name": "bob",
    "job": "programmer"

programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob

  {
    "name": "bob",
    "job": "programmer"
  }

[1] https://github.com/1rgs/jsonformer https://github.com/newhouseb/clownfish Note: guidance is way more general of a tool than these

Edit: spacing