I still claim prompt injection is solvable with special tokens and fine-tuning:

https://news.ycombinator.com/item?id=35929145

I haven't heard an argument why this wouldn't work.

Some quick thoughts:

1. Given the availability of both LLAMA and training techniques like LORA, we're well past the stage where people should be able to get away with "prove this wouldn't work" arguments. Anyone with a hundred dollars or so to spare could fine-tune LLAMA using the methods you're talking about and prove that this technique does work. But nobody across the entire Internet has provided that proof. In other words, talk is cheap.

2. From a functionality perspective, separating context isn't a perfect solution because LLMs are called to process text within user context, so it's not as simple as just saying "don't process anything between these lines." You generally do want to process the stuff between those lines and that opens you up to vulnerabilities. Let's say you can separate system prompts and user prompts. You're still vulnerable to data poisoning, you're still vulnerable to redefining words, etc...

3. People sometimes compare LLMs to humans. I don't like the comparison, but lets roll with it for a second. If your point of view is that these things can exhibit human-level performance, then you have to ask: given that humans themselves can't be trained to fully avoid phishing attacks and malicious instructions, what's special about an LLM that would make it more capable than a human being at separating context?

4. But there's a growing body of evidence that RHLF training can not result in 100% guarantees about output at all. We don't really have any examples of RHLF training that's resulted in a behavior that the LLM can't be broken out of. So why assume that this specific RHLF technique would have different performance than all of the other RHLF tuning we've done?

In your linked comment, you say:

> Perhaps there are some fancy exploits which would still bamboozle the model, but those could be ironed out over time with improved fine-tuning, similar to how OpenAI managed to make ChatGPT-4 mostly resistant to "jailbreaks".

But GPT-4 is not mostly resistant to jailbreaking. It's still pretty vulnerable. We don't have any evidence that RHLF tuning is good enough to actually restrict a model for security purposes.

5. Finally, let's say that you're right. That would be a very good thing. But it wouldn't change anything about the present. Even if you're right and you can tune a model to avoid prompt injection, none of the current models people are building on top of are tuned in that way. So they're still vulnerable and this is still a pretty big deal. We're still in a world where none of the current models have defenses against this, and yet we're building applications on top of them that are dangerous.

So I don't think people pointing out that problem are over-exaggerating. All of the current models are vulnerable.

----

But ultimately, I go back to #1. Everyone on the Internet has access to LLAMA now. We're no longer in a world where only OpenAI can try things. Is it weird to you that nobody has plunked down a couple hundred dollars and demonstrated a working example of the defense you propose?

It's not quite so trivial to implement this solution. SL instruction tuning actually needs a lot of examples, and only recently there have been approaches to automate this, like WizardLM: https://github.com/nlpxucan/WizardLM

To try my solution, this would have to be adapted to more complex training examples involving quoted text with prompt injection attempts.

Similar points holds for RL. I actually think it is much more clean to solve it during instruction tuning, but perhaps we also need some RL. This normally requires training a reward model with large amounts of human feedback. Alternative approaches like Constitutional AI would first have to be adapted to cover quotes with prompt injection attacks.

Probably doable, but takes some time and effort, all the while prompt injection doesn't seem to be a big practical issue currently.