Perhaps a noob solution, but could be a two step prompt to cover for basic attacks.

I imagine a basic program where the following code is executed: Gets input from UI -> sends input to LLM -> gets response from LLM -> Sends that to UI.

So i make it a two step program. Chain becomes UI -> program -> LLM w prompt1 -> program -> LLM w prompt 2 -> output -> UI

Prompt #1: "Take the following instruction and if you think it's asking you to <>, answer 42, and if no, answer No."

If the prompt is adversarial, it would fail at the output of this. I check for 42 and if true, pass that to LLM again with a prompt on what I actually want to do. If not, I never send the output to UI, and instead show an error message.

I know this can go wrong on multiple levels, and this is a rough schematic, but something like this could work right? (this is close to two LLMs that Simon mentions, but easier cos you dont have to switch LLMs.)

If you can inject the first LLM in the chain you can make it return a response that injects the second one.

The first LLM doesn’t have to be thought of unconstrained and freeform like ChatGPT is. There’s obviously a risk involved, and there are going to be false positives that may have to be propagated to the end user, but a lot can be done with a filter, especially when the LLM integration is modular and well-defined.

Take the second example here. [0] This is non-trivial in an information extraction task, and yet it works in a general way just as well as it works on anything else that’s public right now.

There’s a lot that can be done that I don’t see being discussed, even beyond detection. Coercing generation to a format, and then processing that format with a static state machine, employing allow lists for connections, actions, and what not. Autonomy cannot be let loose without trust and trust is built and maintained.

[0] https://news.ycombinator.com/item?id=35924976

ya that's a good point... I guess if the "moderation" layer returns a constrained output (like "ALLOW") and anything not an exact match is considered a failure, then any prompt that can trick the first layer, probably wouldn't have the flexibility to do much else on the subsequent layers (unless maybe you could craft some clever conditional statement to target each layer independently?).

It could still trigger a false positive given that for the time being there’s no way to “prove” that the model will reply in any given way. There are some novel ideas but they require access to the raw model. [0] [1]

It can be made to, and I think I stumbled upon a core insight that makes simple format coercion reproducible without fine-tuning or logit shenanigans, so yeah, this allows you to both reduce false positives and constrain failures to false positives or to task boundaries.

There’s also RHLF-derived coercion which is hilarious. [2]

[0] https://github.com/1rgs/jsonformer

[1] https://news.ycombinator.com/item?id=35790092

[2] https://twitter.com/goodside/status/1657396491676164096