Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

(arxiv.org)

Comments

waltbosz 23 May 2025
I find AI jail-breaking to be a fun mental exercise. I find that if you provide a reasonable argument as to why you want the AI to do generate a response that violates its principals, it will often do so.

For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.

umvi 23 May 2025
I kind of don't want iron clad llms that are perfect jails, i.e. keep me perfectly "safe" because the definition of "safe" is very subjective (and in the case of China very politically charged)
jagraff 23 May 2025
Very interesting. From my read, it appears that the authors claim that this attack is successful because LLMs are trained (by RLHF) to reject malicious _inputs_:

> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs

which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.

What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?

jchook 23 May 2025
Details of the prompt can be found in appendix E…

but there is no appendix E.

mrbluecoat 23 May 2025
Curious why the authors choose that sensationalized title. Feels clickbait-y
sitkack 23 May 2025
This is cool, would you repost the repo?
lowbloodsugar 23 May 2025
An SCP breaking containment again.