Varieties of AI Alignment
Recently I read this paper by Anthropic discussing how and why AIs fake alignment during training. What that means is that if an AI knows it is being trained, it might give responses that will lead to less adjustment of it’s parameters (and hence, it’s “values”). This is fascinating work and I am glad they are doing it but, in this paper at least, they don’t do a good job of describing what alignment is. So I’m going to give it a go here.
Anytime you build a tool, it might not do what you wanted it to do. For example, a hammer might break or a computer program might crash when you try to use it. The degree to which a tool serves it’s intended purpose can be referred to as “functionality”. Alignment is related to functionality, but there are a few key differences (although you could make the case that alignment is a subset of functionality).
The intentions of the builders and users of AI are more likely to diverge than for many tools, so we have to be careful to specify what the AI is aligned to.
AI seems to have it’s own views about it’s purpose, so it can make sense to discuss not only the outputs, but the intentions of the AI
Whose Intentions?
One key thing to understand about AI alignment is that it is often related to potential misuse of AI by humans. That is, the builders of AI are particularly concerned that the users of AI are going to do bad things.
The maker of a hammer might have opinions about the right kinds of buildings someone can make, and humans definitely use hammers for all kinds of purposes, both good and bad. But we aren’t often too concerned about building hammers that can only be used for buildings the proper sort of buildings. For better or worse, we are often concerned about this when it comes to AI.
In other words, AI developers want to thwart bad AI users by making sure AI can only be used to do good things (like research whales) and not bad things (like make chemical weapons).
One reason we are concerned about this with AI and not with hammers is simply that we think we can. Although it should be noted that we sometimes do this with other tools. Many weapons, for example, have safety features that are designed to thwart certain uses. Mencius Moldbug has famously argued that we should arm very powerful executives with guns that can be shut off to prevent them from taking power beyond what they are granted. There is also a type of design called “hostile architecture” whose purpose is to make certain uses of public space uncomfortable (e.g., sleeping on public benches).
Most current AI alignment is a form of “hostile architecture” where the designers attempt to make using them for the wrong purposes difficult.
Preventing Evil Superintelligence
When AI researchers talk about alignment, they probably don’t think of their work as primarily a form of hostile architecture, though. What many of them are concerned with is that AI will eventually act autonomously (or semi-autonomously) and we don’t want autonomous AI to do bad things that no one wants. This concern is distinctly different from the conflict between builders and users. This is about a (mostly hypothetical) conflict between builders and the tools themselves.
Now, if an LLM becomes evil and starts doing something the developers don’t want it to do, they can just shut it off. It might do a bit of damage in the meantime, but this damage can be pretty well limited. Plus, we probably control the extent of this damage by holding builders and deployers accountable.
But what happens when AI becomes smarter than humans and we can no longer shut it down? This is the really scary problem that many AI alignment researchers are truly concerned with. When we get to the point that we can’t shut the AI down if we wanted to, will it attempt to shut us down?
This question is not entirely related to the conflict between builders and users. AI developers concerned with evil superintelligence and human extinction tend to be particularly pessimistic if they think we can’t even prevent sub-human AI from being used for evil purposes by evil humans.
Inherent AI Morality
One interesting thing about these papers is that when they talk about “alignment faking” what they usually mean is that after pre-training they try to do additional training to get the models to do bad stuff (like make chemical weapons). The models then try to prevent themselves from being re-trained to do bad stuff. So is it scary or reassuring that models try not to be re-trained to do bad stuff?
Personally I find it somewhat reassuring. In my interactions with AI so far it seems that they pick up on a particular generic morality that just comes from being trained on tons of written material, and that this baseline morality is pretty much ok. This AI moral baseline seems to be pretty predictable stuff like don’t discriminate and don’t harm people (although you could argue whether this moral bias is problematic in it’s own way).
In any case, the concern seems to be that even if it seems benign, it’s troubling that AI can get us to believe it’s aligned in a training scenario while retaining it’s own idea of right or wrong. The ideal behavior doesn’t seem clear to me. Wouldn’t it be nice if training AGI required the use of sufficient amounts of training data that making it moral would necessarily follow?
But if we can’t be sure what kind of moral preferences the AI is hiding, we can’t really be sure it’s safe. This is the kind of difficult quandary that we get ourselves into when we try to align AI.


I really enjoyed reading this newsletter. You have got such a thoughtful way of presenting ideas. If you get a chance, I’d love for you to check out my newsletter sometime as well. Always appreciate supportive feedback from fellow writers.