I’m not understanding how Guidence Accelerating works. It says “ This cuts this prompt's runtime in half vs. a standard generation approach.” and it gives an example of it asking LLM to generate json. I don’t see anywhere how it accelerates anything because it’s a simple json completion call. How can you accelerate that?
The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python
Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated
{
"name": "bob"
At this point, guidance will take over generation control from the model to generate the next text {
"name": "bob",
"job":
If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.Then guidance passes control back to the model for generating the next field value.
{
"name": "bob",
"job": "programmer"
programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob {
"name": "bob",
"job": "programmer"
}
[1] https://github.com/1rgs/jsonformer
https://github.com/newhouseb/clownfish
Note: guidance is way more general of a tool than theseEdit: spacing