The core reason (and thus the proper place to fix) for any injection attack is unclear distinction between data and instructions or code.

Yes, language models gain flexibility by making it easy to mix instructions and data, and that has value, however if you do want to enforce a distinction you definitely can (and should) do that with out-of-band means, with something that can't possibly be expressed (and thus also overriden) by some text content.

Instead of having some words specifying "assistant do this, this is a prompt" you can use explicit special tokens (something which can't result from any user-provided data and has to be placed there by system code) as separators or literally just add a single one-bit neuron to the vectors of every token that specifies "this is a prompt" and train your reinforcement learning layer to ignore any instructions without that "privilege" bit set. Or add an explicit one-bit neuron to each token which states "did this text come in from an external source like webpage or email or API call".

[edit] this 'just' does gloss over technical issues, such as handling that during pre-training, the need for masking something, as for performance reasons we do want the vectors to be multiples of specific numbers and not just an odd number, etc - but I think the concept is simple enough that it can't be an obstacle but just a reasonable engineering task.

You can see the trend of prompts getting more and more formal. One day we will have some programming language for llm.

SLLMQL - Structured LLM Query Language