I find AI jail-breaking to be a fun mental exercise. I find that if you provide a reasonable argument as to why you want the AI to do generate a response that violates its principals, it will often do so.
For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.
I kind of don't want iron clad llms that are perfect jails, i.e. keep me perfectly "safe" because the definition of "safe" is very subjective (and in the case of China very politically charged)
Very interesting. From my read, it appears that the authors claim that this attack is successful because LLMs are trained (by RLHF) to reject malicious _inputs_:
> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs
which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.
What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking
(arxiv.org)37 points by favoboa 21 May 2025 | 32 comments
Comments
For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.
> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs
which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.
What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?
but there is no appendix E.