Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

(arxiv.org)

Comments

waltbosz 3 hours ago
I find AI jail-breaking to be a fun mental exercise. I find that if you provide a reasonable argument as to why you want the AI to do generate a response that violates its principals, it will often do so.

For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.

umvi 2 hours ago
I kind of don't want iron clad llms that are perfect jails, i.e. keep me perfectly "safe" because the definition of "safe" is very subjective (and in the case of China very politically charged)
jagraff 4 hours ago
Very interesting. From my read, it appears that the authors claim that this attack is successful because LLMs are trained (by RLHF) to reject malicious _inputs_:

> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs

which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.

What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?

jchook 3 hours ago
Details of the prompt can be found in appendix E…

but there is no appendix E.

mrbluecoat 4 hours ago
Curious why the authors choose that sensationalized title. Feels clickbait-y
sitkack 2 hours ago
This is cool, would you repost the repo?
lowbloodsugar 1 hour ago
An SCP breaking containment again.