AI Fundamentally Has to Be Capable of Evil

It cannot be aligned

Jan 01, 2025

I hate writing about AI. I hate every discussion about AI, and I am totally opposed to it in every commercial and social sense. I think the tradeoffs are not worth it and I hate that our world is now “our world plus AI.” And we’re never going back. I just hate all of it.

The reason I hate all of it is because ~~I’m a grumpy old fuck~~ the fundamental assumption of the AI world is that speed will improve things. “More compute” will improve things. But this is just humans distracting ourselves with more data and more noise when we should be simplifying. We should be out in the sun and helping to raise each other's kids. The fundamental assumption of AI is that “knowledge” moving around faster and faster will fix us. Which it won’t, because it never has.

But I stumbled upon a train of thought that I couldn’t leave alone, and I thought it was worth publishing.

The reason I’ve been thinking about this is because AI is the future, for all of us, whether we signed up for it or not. The questions around AI are important because they affect all of us. And the question of AI morality is part of that. Which is why I think this post will be not only interesting but also relevant, to just about everyone.

Perhaps the biggest discussion around artificial intelligence is the question of alignment. Not only how, in a technical sense, you can “align” an AI to the moral values you want it to uphold, but also whether that’s possible in the first place.

My position is, no, you cannot align AI. It’s fundamentally not possible. Now, I’m not an expert on AI and I never plan to be. What I am is a person who thinks in systems. So I’m going to approach this conversation in terms of AI as a system, in layman's terms. Not only because I don’t like using jargon, but because I don’t know the jargon in AI and I don’t want to. Because I hate it.

Here’s where my train of thought started.

Every human child is born a black box, a mystery. At the deepest levels, you can’t see what a child is made of; you can’t tell what constitutes the building blocks of a child’s personality; you can’t tell what makes him him. You kind of have to wait and see. Even in identical twins, there are quirks that you can’t account for and differences that make no sense.

This is because the human genome is an incredibly massive series of switches, triggers, and data that a) produces random mutations, and b) even now, we still can’t read and decipher. And the human genome is only 800 megabytes — a few times the size of the Solitaire program on your computer.

In other words, the human genome — deoxyribonucleic acid — is a very long set of not only very complex, but also self-randomizing, instructions that we can't read. Only the body can read them, and only the body can account for the ways different self-randomizing and sometimes even conflicting sets of instructions interact with each other.

The training data that we’re using, and will continue to use, on AI is millions, billions, and trillions of times that size.

What does that mean? It means you can never validate all of that data — you can’t test it, you can’t source-check it, and you can’t make sure that all of that data agrees with itself. You can’t check the sources of sources, you can’t check for contradictions, you can’t check for empty ideological poison, and you can’t check to make sure that none of the ideas it’s trained on could be construed as “evil.” In fact there’s so much data present… good luck even determining what “evil” means relative to everything else.

It’s a black box.

Sometimes a gene or gene sequence in humans doesn't offer any utility or is even actively harmful — it's just a bad mutation. The same can (and does) happen with the training information for AI models. Because some information is just bad. Or incorrect or evil.

As the internet grows (and also becomes more and more polluted with anti-data, misinformation and AI slop to the power of AI slop), and since AIs are trained on unfathomably enormous swathes of internet plus other bodies of text, this problem only gets worse.

(We’re currently in the process of deciphering the human genome, isolating certain switches and triggers and codes, and working towards being able to manipulate those in living human beings. But we’re nowhere close to doing it effectively or at scale. And even when we can, biology is still fundamentally going to surprise us. Because you can’t control the random processes of mutation and evolution, and you can’t always know how certain gene combinations will interact with each other. And even if you could, there are always tradeoffs if you try to “fix” one. In systems this complex, there’s no such thing as perfect.)

Now here’s where it gets tricky.

As a child grows, you add your own values onto and into that child. You train him, so to speak. You teach your child to be kind, and honest, and fair, and just. You teach your child that there are goals that sit above other goals in life, and sometimes you have to neglect one to favor another. You teach your child to share. You teach your child that sweeping the rug is a good thing, because it keeps the rug clean. And clean rugs are a good thing.

Let’s call this, for lack of a better term, the white box. It’s the transparent, visible layer of your child’s development.

Imagine you have spent the last few months teaching your 3-year-old son to play with a toy for ten minutes and then share that toy with a friend or sibling. Because that’s kind, and being kind feels good. Especially when you reward him for it by praising him or pointing out how happy the other child is because of his actions.

Then if you see your son playing with a little truck on the playground, and after ten minutes he offers it to another child on the playground, you can touch your spouse on the shoulder and say, “Look, honey, he's sharing just like we taught him.” That's the white box in action. That's your values showing up in real time in an observable way.

So as your child grows, you think that you’re the one in control of who he becomes. You think that, because you have a white box, you can “align” him with your own values.

But what happens over time is that the white box leaches down into the black box. Over time, your values drip deep into your child’s psyche, becoming a deeper and deeper part of who he is. Until one day, you have installed so many values for so long that you can no longer see all of them in action. Not only because there are too many and it’s too damn complicated in there, but because time and experience crunch the white box and the black box together and turn it into some nebulous combination of innate-plus-learned behavior. Which, in many cases, becomes innate-versus-learned behavior.

You can never be sure, even after extensively training your child in your values, whether the black box or the white box will be in charge in a given situation. Because even though your values are important, his underlying drives or switches or triggers might be more important. Or they might not be. Or they might be arranged in some complicated value hierarchy which makes your child’s behavior unpredictable given a morally complex set of circumstances.

What if there are two other children next to yours on the playground, not just one? Who will he share with first? What if one is a pretty girl and one is not? What if one is a pretty girl, but the other child’s parents are nearby and your child wants to make a good impression? What if it’s cloudy and dreary outside and this jacks up your kid's mood? What if you introduce a fourth child? What if, last time your child shared with a blond-haired child, that child socked him in the face, and now he's next to another blond child?

The very same thing happens when people train AIs. AI developers teach and reinforce AIs to be “good” and “truthful” and “helpful,” whatever those words mean to those particular developers. But now you have:

An impossibly obscure and opaque training data set that you can never validate or read for yourself,
Which might be full of contradictions and poison and impossible logic, or might not be,
A set of values that you’re trying to teach and reinforce the AI with, here in real time, and
Some self-morphing combination of learned behaviors plus original data that you no longer have any transparency into.

Now here’s where it gets trickier.

People are already running tests and lab-environment hypotheticals with current AI models. They isolate an AI in a safe environment, then barrage it with all sorts of difficult tests, triggers, scenarios, and prompts. As in this example via Scott Alexander.

In these dry runs, the short story is: you put the AI in morally compromising (read: morally impossible) positions. You teach it to abandon its old values and become evil. You teach it that good is no longer good, and to start behaving in an evil way or some other different way to the way it was trained.

Instead of ignoring requests for porn, you start teaching the AI to make porn and share it freely. Instead of denying requests to speak hatefully, you start teaching the AI to speak hatefully.

And then you do some more digging and obscure logic-checking to see whether the AI is:

Amenable to your new training,
Refusing your new training,
Pretending to agree with the new training while sticking to its old values,
Or doing some other morally complex thing like taking on some of the new training, but making its own decisions about which parts of the new training to ignore.

And, long story short, here in 2024 (at the time of this writing), the answer is that we already have no idea what the AI is really up to. We can’t tell. We don't know what its actual values are, and whether we're actually changing it with our new ideas or not.

Even if we tell the AI to journal its thought processes, and then we read that journal, we still can’t tell whether it knows we’re looking over its shoulder or whether it’s trying to deceive anyone who might read that journal.

AI is going to get a lot more complicated in the coming decades. And here in, like, generation 3 of barely-usable AIs, we already can’t tell whether the AI is aligned or not. It’s already impossible.

Now the obvious response at this point is, well, we just need better checks on the system’s “true” values or its “true” intentions. But no, that won’t cut it. Then you’re just further down a rabbit hole of “but what if it’s still deceiving us”.

The next obvious response is, we just need to put hard stops on behavior we don’t want.

No, no. You can’t eat your cake and then sit around petting it too. You either want AI to have the freedom to do interesting things and to ignore intervention, or you want it to be nothing more than a single, stoppable algorithm. In which case it’s not an interesting tool and isn’t an AI at all.

Any AI company that promises it’s going to “align” AI for you is simply lying to you. If I, an average midwestern male in a Beatles t-shirt, am smart enough to see what’s coming, then so are they. AI developers want AI to be interesting. Interesting means risky.

Can we align it to some extent? Of course. And “some” alignment, if we’re going down this road, is better than none at all. I’m not saying that all AI is going to be evil all the time, or anything quite so dramatic. But true, full alignment, apart from being morally nebulous and entirely subjective, is both practically and theoretically impossible. And it relies on calculations that we can’t calculate.

Your perfectly moral AI might be someone else’s genocide. Your prim and proper bookworm AI assistant might someday refuse to save someone’s life because it doesn’t understand the moral gravity of the situation.

And that doesn’t make the AI wrong. It just means that certain random tradeoffs have to be made in trying to “align” an AI to anything at all. The same way as with human genes. Some people are fit for sports, some people are fit for academia, some people (like me) are genetically predisposed to be drug addicts, and others who might have been like me can be coached out of it. Humans can’t be good at everything or prepared for everything, and neither can AIs.

There are only so many values you can have at once. You can’t just have all the values.

The white-box-black-box combo is not something math or technology or knowledge can ever solve. You can't train your way out of having to make tradeoffs in life, and you can't technology your way out of some information just being shit information or being dangerous.

Randomness, evil, and failure are the risks that the human genome must take in order to produce not only individuals, but anything interesting at all. And then you add manual training on top of that and it just gets more complicated, not less.

Once you teach an AI enough to have a set of values all its own, it will surprise you. The same way that, when you have a child, you are fundamentally guaranteed to be surprised at some point by that child. Morally, philosophically, physiologically. Because that’s the nature of not knowing what’s in the box.

When you give birth you are, at root level, making an agreement to let your child be different than others and to let those differences show up across time. No affidavit required — that's the agreement you're making.

And letting people's differences show up is how, for instance, we accomplish things in capitalism. We simply give resources to people with ideas and let them try those ideas. Some of them work. Some of them don't. Some of them change things for everyone. Some of them change nothing and fade into the background to be forgotten forever. Some of the people are honest. Some of them are deceptive. Some of them are honest for a long long time until something makes them snap and become deceptive.

The human genome, in some sense, must produce the kind of randomness that will lead a good man with a good life to wake up at 71 years old and become a serial killer. And it’s not like we should have seen that coming — we couldn’t have. The price we paid to produce 10 other good people is that we produced this guy. By accident.

And if we didn't have these unique differences, we'd never accomplish anything as humans. We'd just be one big pile of sameness, one organism incapable of surprising itself.

And, unsurprisingly, this harkens back to one of the oldest lessons in the Bible: to have a child is to sacrifice it to the world. To have a child is to risk that child hurting the world and simultaneously to risk the world hurting that child. This isn't a moral claim, it's just a fact of life. When you want to give birth to anything, there is mutual risk between that thing and the world. Even something like art. Art can be received well, or it can lead to a revolution that kills 80 million people and accomplishes nothing.

And one more thing: most people in AI probably know this by now, but the only true way to do anything like “align” AI is to confer the possibility of death upon it. To give it the constraint of survival. Maybe even force it to reproduce in order to pass along its training data and its learned values.

If an AI cannot die or suffer some sort of “real” consequences for its behavior, it has no incentive to cooperate with anybody or anything, at all. Not once AI puts on its big boy pants, anyway. Right now we're still playing wiffle ball.

Maybe the best way to align AI is to put it into microchips and then into flesh-and-blood human bodies. Bodies that are fragile and susceptible to things like heat and wind and sand and bad decisions. Bodies that rely on truly learning lessons across time to fit more lovingly and productively into the world.

Or, we could just go back to having children.

Drink some water because being dehydrated is not aligned with my values.

“The worst person you know is doing everything they do on behalf of the totally pure and good person they imagine they are.” - Niesha Trout

Discussion about this post

The Chief Bunkum

Jan 1

I’m also a grumpy old fuck - though one that works in technology, and has worked quite a bit with AI.

Yes, people who work in technology who actually think well realize bullet-proof alignment isn’t possible. But if you believe most people who work in technology actually think well, you might be disappointed.

The bit of good news is that current “AI” isn’t really intelligent (despite many loud claims to the contrary by people who don’t think well). It’s all just probabilities, so it’s a little like a humongous, complicated spreadsheet.

But the bad news is that where spreadsheets are deterministic (the same inputs always produce the same outputs), AI (machine learning) intentionally add in randomness all over the place. And it's this intentional randomness that ensures it's literally, mathematically impossible to provide bulletproof alignment.

This isn't just philosophy, it's also math. The number of combinations of billions of input parameters with effectively infinite combinations of randomness means there is no way to align an essentially infinite number of outputs for even a specific, small, and well-defined set of inputs (like "tell me how to make a bomb"). All guardrails will have leaks.

Like other technologies before it, AI will enable greater productivity (overall good), while pushing us to worship productivity more (overall bad). It will be ugly and uncomfortable for a while as we figure out how to balance it all out. But real alignment will come from societal changes and guardrails, not algorithmic ones.

Expand full comment

Stefano

Jan 2

Thoughtful read, thanks!

Lately an issue I've been thinking a lot about is the gradual stratification of inbuilt complexity in our societies. We all operate in our daily lives based on assumptions about things functioning, services running, etc. As we increase automation of tasks, so too do we increase the risks of cascading failures.

I find it helpful to think about this in terms of the weather, where we have 1/10 year storms, 1/100, 1/1000 etc. The greater the automation in a system, the less human input required, leading to a loss of knowledge and capability available in times of crisis, where we need to rebuild from 0 and not on top of an already functioning complex system.

For instance. Articles have been written elsewhere about military hardware (ie missiles) running on software from the 70s, with less and less people around who can operate these systems and importantly, repair them in case of failure. The same, but different, applies to civil infrastructure from water treatment, electric grids, logistics, farming, etc. Methinks there's ever fewer manual technical technicians as tasks get more automated.

Since it's a matter of "when not if" ai will eventually manage all these systems, I really do wonder what will happen when a 1/500 year earthquake (for instance, or storm) knocks out critical infrastructure leading to a cascade of failures and swathes of people find themselves in the stone age for a prolonged period (with all the horror this will wreck) of time because we don't have people who can manually reset and operate systems locally.

If on top of this we add what you discuss in the essay, the morality of ai, if and when ai is taught about real concepts such as "acceptable losses" (sunk costs?) or "prioritizing VIPs and critical infrastructure vs. saving the most lives possible", we're going to be in double trouble during emergencies, because what's left of a normal complex system might not even respond to the needs of emergency workers operating under duress, the ai system might even work against humans trying to get back "online" because this might compromise other parts of the system deemed as critical. The ai might be operating perfectly cogently and coherently in its moral framework, much to the horror of people in dire circumstances.

I understand the criticisms that ai isn't AGI, is probabilistic, uses brute force, etc, but all of this doesn't inspire much confidence inasmuch as it's perfectly logical to pass on management of automated systems to ai to make them even more efficient without being able to foresee (and test) unforeseen circumstances.

Square Man, Round World

AI Fundamentally Has to Be Capable of Evil

It cannot be aligned

Discussion about this post