Problems in AI alignment: A scale model Hackernews Viewer

Problems in AI alignment: A scale model

45 points by hamburga 22 May 2025 | 42 comments

Comments

I think this post is sort of confused because, centrally, the reason "AI Alignment" is a thing people talk about is because the problem, as originally envisioned, was to figure out how to avoid having superintelligent AI kill everyone. For a variety of reasons the term no longer refers primarily to that core problem, so the reason so many things that look like engineering problems have that label is mostly a historical artifact.

blamestross 18 hours ago

I'm kind of upset to see systematically "Alignment" and "AI Safety" co-opted for "undesirable business outcomes".

These are existential problems, not mild profit blockers. Its almost like the goals of humanity and these companies are misaligned.

shwouchk 9 hours ago

the world was completely unprepared for fully autonomous botnets

benlivengood 15 hours ago

https://slatestarcodex.com/2014/07/30/meditations-on-moloch/ is the essay that crystalizes the reasons that Selection and Markets are not the forces we want to be aligning AI (or much else).

In short (it is a very long article) fitness is not the same as goodness (by human standards) and so selection pressure will squeeze out goodness in favor of fitness, across all environments and niches, in the long run.

constantcrying 11 hours ago

>Why isn’t there a “pharmaceutical alignment” or a “school curriculum alignment” Wikipedia page?

>I think that the answer is “AI Alignment” has an implicit technical bent to it. If you go on the AI Alignment Forum, for example, you’ll find more math than Confucius or Foucault.

What an absolutely insane thing to write. AI Alignment is different because it is trying to align something which is completely human made. Every other field is aligned "aligned" when the humans in it are "aligned",

Outside of AI "alignment" is the subject of ethics (what is wrong and what is right) and law (How do we translate ethics into rules).

What I think is absolutely important to understand is that throughout human history "alignment" has never happened. For every single thing you believe to be right there existed a human who considered that exact thing as completely wrong. Selection certainly has not created alignment.

verisimi 14 hours ago

> While Nature can’t do its selection on ethical grounds, we can, and do, when we select what kinds of companies and rules and power centers are filling which niches in our world. It’s a decentralized operation (like evolution), not controlled by any single entity, but consisting of the “sum total of the wills of the masses,” as Tolstoy put it.

Alternatively, corporations and kings can manufacture the right kinds of opinions in people to sanction and direct the wills of the masses.

daveguy 22 hours ago

This is an excellent point. How we choose to use and interact with AI is an individual and stochastic collective.

We can still choose not to give AI control.

godelski 18 hours ago

A critical part of AI alignment is understanding what goals besides the intended one maximize our training objectives. I think this is a thing that everyone kinda knows and will say but simultaneously are not giving anywhere near the depth of thought necessary to address the problems. Kind of like a clique: something everyone can repeat but frequently fails to implement in practice.

Critically, when discussing intention I think there is not enough attention given to the fact that deception also maximizes RLHF, DPO, and any human preference based optimization. These are quite difficult things to measure and there's no formal mathematically derived evaluation. Alignment is incredibly difficult even in settings where measures have strong mathematical bases and we have means to make high quality measurements. But here, we have neither...

We essentially are using the Justice Potter definition: I know it when I see it[0]. This has been highly successful and helped us make major strides! I don't want to detract from that in any way. But we also do have to recognize that there is a lurking danger that can create major problems. As long as it is based on human preference, well... we sure prefer a lie that doesn't sound like a lie compared to a lie that is obviously a lie. We obviously prefer truth and accuracy above either, but the notion of truth is fairly ill-defined and we really have no formal immutable definition outside highly constrained settings. It means that the models are also optimizing that their errors are difficult to detect. This is inherently a dangerous position, even if only from the standpoint that our optimization methods do not preclude this possibility. It may not be happening, but if it is, we may not know.

The is the opposite of what is considered good design in all other forms of engineering. A lot of time is dedicated to error analysis and design. We specifically design things so that when they fail, or being to fail, that they do so in controllable and easily detectable ways. You don't want your bridges to fail, but when they fail you also don't want them to fail unpredictably. You don't want your code to fail, but when it does you don't want it leaking memory, spawning new processes, or doing any other wild things. You want it to come with easy to understand error messages. But our current design for AI and ML does not provide such a framework. This is true beyond LLMs.

I'm not saying we should stop and I'm definitely not a doomer. I think AI and ML do a lot of good and will do much more good in the future[1]. It will also do harm, but I think the rewards outweigh the risks. But we should make sure we're not going into this completely blind and we should try to minimize the potential for harm. This isn't a call to stop, this is a call for more people to enter the space, a call for people already in the space to spend more time deeply thinking about these things. There's so many underlying subtleties that they are easy to miss, especially given all the excitement. We're definitely on an edge now, in the public eye, where if our work makes too many mistakes or too big of a mistake that it will risk shutting everything down.

I know many might interpret me as being "a party pooper", but actually I want to keep the party going! But that also means making sure the party doesn't go overboard. Inviting a monkey with a machine gun sure will make the party legendary, but it's also a lot more likely to get it shut down a lot sooner with someone getting shot. So maybe let's just invite the monkey, but not with the machine gun? It won't be as epic, but I'm certain the good times will go on for much longer and we'll have much more fun in the long run.

If the physicists can double check that the atomic bomb isn't going to destroy the world (something everyone was highly confident would not happen[2]), I think we can do this. Stakes are pretty similar, but the odds of our work doing high harm is greater.

[0] https://en.wikipedia.org/wiki/Potter_Stewart

[1] I'm a ML researcher myself! I'm passionate about creating these systems. But we need to recognize flaws and limitations if we are to improve them. Ignoring flaws and limits is playing with fire. Maybe you won't burn your house down, maybe you will. But you can't even determine the answer if you won't ask the question.

[2] The story gets hyped, but it really wasn't believed. Despite this, they still double checked considering the risk. We could say the same thing about micro-blackholes with the LHC. Public finds out and gets scared, physicists really think it is near impossible, but run the calculations anyways. Why take that extreme level of risk, right?