A Lost Eliezer Yudkowsky Essay, “A Galilean Dialogue On Friendliness”

Below is a 2003 essay by Eliezer Yudkowsky that had been lost to the sands of time. I’m reposting it here because, although very old, it makes good points about human values and optimization that I haven’t seen better explanations of elsewhere. Eliezer adds this disclaimer:

Published with permission from Eliezer Yudkowsky, who notes that this is from 2003 and therefore he is unable to read it and has no idea anymore what’s in here, or how much of it is wrong.

Five people – Autrey, Bernard, Cathryn, Dennis, and Eileen – want to divide up a cake.

Autrey: I think everyone should get 20% of the cake.

Cathryn: Sounds fair. 20% apiece.

Eileen: I also agree. 1/5 for everyone.

Dennis: I want the entire cake for myself.

Bernard: Personally, I also agree that everyone should get 20% of the cake. So to be fair, Dennis should get 36% of the cake, and everyone else should get 16%. That way we’re taking everyone’s desires into account.

Dennis: Forget it! That’s not fair! Who says that everyone’s desires should be weighted at 20%? I think that only my desires should count. You’re just trying to sneak in your way of looking at the world under a different name.

Autrey: Sorry, Bernard, it doesn’t work that way. I already took the impulse toward fairness into account in saying that everyone should get 20% of the cake. You can’t count the same factor twice.

Cathryn: I’ll say! What, I should be penalized for being an altruist? If Dennis gets 36%, I’m switching my preferences to 100% on the next round!

Eileen: I’d hate to see that happen, Cathryn. I like people who care about fairness and I wouldn’t want to see them penalized. Bernard, you need to take into account my preferences about the kind of system I want to live in and the kind of behavior it encourages, not just my immediate thoughts about the cake.

Bernard: Wow, look at all these different ethical imperatives… I’m not sure what happens when I apply 20% of each of them… maybe I should go by majority vote?

Dennis: Majority vote? That’s not fair! Who says that anyone else should get a vote?

Cathryn: Just because we all vote on something doesn’t make it right.

Bernard: “Right?” I know what it means to vote on something, but what does it mean for an ethical system to be “right”? There’s no criterion for deciding between ethical systems.

Autrey: There is under my ethical system. How would you take 20% of that into account?

Cathryn: Um, Eileen, what are you doing over there?

Eileen: I’m building a Friendly AI.

Autrey: That sounds like an action with implications far beyond our current dilemma.

Eileen: But it will solve the cake-division problem. That’s the beautiful thing about massive overkill.

Bernard: I don’t see how building a Friendly AI helps. An AI might help to implement a solution for dividing the cake, but how could a FAI help decide on a solution for dividing the cake?

Eileen: Deciding on a solution for dividing the cake is a cognitive task. If you understand a cognitive task deeply enough – very deeply, down to the level of pure causal dynamics – you can embody it as a computer program.

Cathryn: But dividing a cake isn’t just a straightforward computation like factoring a large composite number… I mean, we’re arguing about it. That’s more like… I don’t know quite what it’s like, actually.

Autrey: If you don’t know what it’s like, I guess you can’t embody it as a computer program.

Bernard: I don’t see what the ability to construct an AI gains us, except the ability to divide a lot more cake a lot faster, if we can agree on how to do it. Not that this would be a bad thing, I’m just asking about the relevance to the basic problem.

Eileen: Okay, here’s an example of a relevant case. Let’s suppose that three people can’t agree that any particular, specific proposed division of a cake is “fair”. We shall suppose in this example that the three agree on a moral principle for dividing the cake, but whenever one of them proposes a concrete division, the others disagree because it looks to them like the application of the moral principle has been unconsciously prejudiced by that person’s self-bias. If these people possess the skill to specify and create minds, they can embody the moral principle they agree upon in an independent cognitive system. Then they can agree to abide by the judgment of that mind. As you asked, this is an example of an ethical dilemma that can be solved more easily given the ability to construct an AI.

Autrey: This problem can also be solved by appealing to an uninvolved sixth party, who also agrees to the moral principle in question, to divide the cake.

Eileen: It depends on what kind of biases the people involved are worried about. Appealing to a randomly selected sixth party might get you a randomly biased unfair division.

Bernard: If you have no way of knowing which direction the bias is in, doesn’t that make the division fair? I’m reminded of someone who complained that the draft lottery wasn’t fair because the draft slips weren’t stirred enough. Who, specifically, was that lottery unfair to?

Cathryn: If you conducted a lottery that gave one winner the entire cake, and each of the three people had an equal chance of winning, that might be symmetrical, but it wouldn’t be fair.

Bernard: In the long run, repeated with enough cakes, it would be fair to within epsilon.

Cathryn: Some problems aren’t long-run problems. If an AI devised a lottery that gave all the money and property in the world to one person, and each of the six billion living humans had an equal chance of being selected, that would be a symmetrical lottery but not a fair solution.

Dennis: I’ll say. This lottery has only one chance in six billion chance of being fair.

Bernard: Why would appealing to an AI be better than appealing to an uninvolved sixth party?

Eileen: The uninvolved sixth party has her own set of human biases. Even if she wants to be totally impartial, she doesn’t have that option. Even if she knows a class of biases exist within herself, she can’t get rid of them just by wishing, because she doesn’t have access to her own source code. One possible solution would be to construct cognitive dynamics that embody the moral principle in the exact form it was agreed upon, without extraneous forces.

Cathryn: Okay, I can see three people agreeing on a principle for dividing the cake. But what if the moral principle they agree on is wrong?

Eileen: The deeper you go, the harder things are to explain… right now I just want to talk about how the ability to create a specified AI can help resolve some classes of negotiation.

Bernard: Cathryn’s question sounds like obvious nonsense to me – no offense. If the three people all agree on it, who’s to say they’re wrong? What I want to know is, what if five people can’t agree on a moral principle for dividing the cake?

Eileen: They might be able to agree on a principle for resolving arguments about how to divide cakes. In fact, if they all had the conceptual equipment to understand the underlying cognitive dynamics, they’d probably find them a lot easier to agree on. Underlying cognitive dynamics have a lot more in common between humans than any specific political position. Our cognitive architectures have more in common than our politics. People all like their own political parties and hate those evil bastards on the other side; that’s a human universal even if the specific political parties are infinitely variable. I think it might prove much easier to agree on fair cognitive dynamics than to agree on whose political party is best.

Bernard: Will your FAI treat everyone’s cognitive dynamics equally?

Eileen: I’m not sure what you mean by that. I certainly don’t plan to introduce asymmetrical mentions of specific humans, if that’s what you mean.

Dennis: Eh?

Eileen: I’m not going to tell the FAI, “Make Dennis the ruler of the world.”

Dennis: Why not?

Bernard: In that case your FAI’s behavior seems easy enough to predict – everyone’s cognitive dynamics get equal input, so the FAI will weight all our definitions of fairness equally. That way we’d have, let’s see, 20% in favor of my definition of fairness, 20% in favor of Dennis’s definition of fairness, and 60% in favor of you three. So Dennis’s share would work out to… let’s see, now… 39%.

Eileen: Okay, that’s a good example of what I do not mean by fairness. That’s not agreeing on cognitive dynamics, it’s superposing our final answers. And the result, which is, no offense, ridiculous, shows the problem with that. You can’t just take surface judgments and embody them in an AI, even if you vote on them. Then you just have… a chatbot, or an encyclopedia. It’d be like combining the most popular present scientific theories, weighted by the number of scientists who believe in them, and hard-wiring them as beliefs. It’d be interesting to look at the output, but the output wouldn’t be a scientist. The result would be frozen in time; there could be no further progress. The result probably wouldn’t even be coherent as a scientific theory; you’d have a set of beliefs that no single individual would ever hold. The thing we’re hunting for is a dynamic cognitive process, not the frozen output of that process.

Dennis: Look, just let me make all the decisions. That’s fair.

Eileen: But, Dennis, you do see that I have no way to listen to you instead of the other people who are saying exactly the same thing?

Dennis: What, they’re saying that Dennis should make all the decisions? Good, they sound like sensible people to me.

Eileen: No, they’re saying that they should make all the decisions.

Dennis: Bah, what arrant nonsense! I hope you’ll dismiss those foolish speculations immediately and tell your AI to pay attention only to Dennis. That’s fair.

Eileen: Um… look, I’m sorry, but that statement has not been phrased in a way which allows it to be argued across moral agents.

Bernard: Now that sounds unfair. Why aren’t Dennis’s desires being taken into account?

Dennis: I’m glad to hear you’re on my side, Bernard! So you also see now that I should get the whole cake?

Bernard: No, I think you should get 36% of the cake.

Dennis: Bah, you’re just as bad as the rest. Why aren’t you taking my preferences into account?

Bernard: I am!

Dennis: No, I mean why aren’t you taking my preferences into account the way I want you to take them into account, not the way you want to take them into account? You aren’t using my preferences at all. Your preferences may change in some bizarre way that depends on my preferences as data, but they’re still your preferences. I want you to use my preferences.

Bernard: Okay, Dennis, I’ll take that into account.

Dennis: You’re insane.

Bernard: Eileen, I still don’t see how you can just say that Dennis’s preferences can’t be communicated. Cutting him off like that isn’t fair.

Eileen: I’m not putting Dennis into the class of a rock or a tape recorder. He can go on trying to come up with a morally communicable argument for becoming personal overlord of the universe. He just hasn’t done so yet. Right now, it’s impossible for either you or I to communicate with Dennis; he’s wrapped himself up in a small private world, morally speaking. He’s not using arguments that make sense in either your or my system. Neither of us can communicate with him about the fair division of the cake any more than we could communicate with a tape recorder playing back “Two plus two equals nine!” about arithmetic. Conversely, he can’t communicate with us either.

Dennis: But you get to decide which moral arguments you’ll be influenced by? Who died and made you God? I should decide that.

Bernard: It does occur to me to ask what rules you’re using.

Eileen: Well, in intuitive terms, imagine Joe saying to Sally, “My number one rule is: Look out for Joe.” If Sally hears that as a moral argument, she’ll hear: “Your number one rule should be: Look out for Sally.” In other words, Sally hears Joe, automatically substitutes “[your name here]” for “Joe”, and hears the general moral argument “Everyone should look out for themselves”. Now if Joe happened to be a Moonie, and said “My number one rule is: Look out for Reverend Sun”, Sally might hear that just the way Joe said it. There are rules and principles, instincts and intuitions, that create the transpersonal morality of humans. Right now, Dennis is behaving sort of like an extremely simple desirability computation devoted to turning the universe into paperclips. Or to put it another way, I can’t see any possible way to build a Friendly AI such that it would give the world to Dennis, without the original core program mentioning Dennis explicitly, which everyone who is not Dennis would say was blatantly unfair.

Cathryn: Eileen, you’re overcomplicating things. 20% apiece is obviously the correct way to divide this cake. It’s the correct answer independently of any amount of arguing we do about it, just like 2 + 2 equals 4 regardless of whether anyone is looking.

Eileen: It takes knowledge to make a physical object, like a calculator, which successfully computes that 2 and 2 make 4 – there are many other possible computations a piece of matter can implement, and most of them aren’t pocket calculators. It takes more knowledge to create a physical process whose output depends on its inputs in a way that steers the universe into particular states – that is, to create a Bayesian decision system. And it takes still more knowledge to create a physical process that can understand moral arguments as moral arguments – to compute the same question you’re computing when you say that 20% apiece is obviously the unique correct answer – with which I happen to agree, by the way.

Autrey: I wouldn’t be too sure that 20% is the answer. Humanity’s moral memes have improved greatly over the last few thousand years; but how much farther do we have left to go? We mostly got rid of slavery, we’re trying to get rid of racial prejudice, things like that. I was born too late myself, but you don’t have to be very old to remember a time when blacks rode in the back of the bus!

Bernard: I remember.

Autrey: What if we, ourselves, look like unkempt barbarians from the perspective of a few years down the road? No, I’ll make the statement stronger; it seems nearly certain that’s how we’ll look. “20%” may just be a solution that we seize upon because our instinct is to choose simple, cheating-resistant solutions that are obvious to all players. Would we pick a less simple but more optimal solution if we lived in a world where people weren’t so tempted to employ elaborate arguments to get more than their share? What if we each have different preferences for cake and icing, making the problem non-zero-sum? What if we’re barbarians for not using Condorcet Voting, Brams-Taylor Fair Division, or Throatwarbler-Mangrove Time-Discounted Volition?

Eileen: I agree that humanity has grown up a lot in the last few thousand years, or even just the last century. And I would say it’s that dynamic process we need to preserve, not a snapshot of where we are today. Individuals improve their moralities, as do civilizations. So a Friendly AI can’t have just the frozen values of the person who happened to create it, or even the frozen values of the civilization that happened to give birth to it. The classical stereotype presents an “AI” as a machine, echoing the frozen knowledge of its creators. But the AIs that just echo stored knowledge don’t grow, don’t learn, and this is not a “philosophical” problem; it reflects a real failure to implement specific cognitive abilities. It is part of the difference between computer programs, which is what we have now, and real AI, which is something no one has created yet.

Cathryn: Define “grown up”.

Autrey: Define it? Why?

Cathryn: Because until you define it, there’s nothing to talk about.

Autrey: I don’t suppose you’ve ever read Robert Pirsig’s “Zen and the Art of Motorcycle Maintenance?”

Cathryn: ‘Fraid not.

Autrey: The main character of the book is an English professor named Phaedrus, trying, heaven help him, to teach his students to write. Now, writing quality – or as Phaedrus would put it, Quality in writing – is one of those things that is very difficult to define; and certainly his students can’t do it. Asked to define quality in writing, no one has any ideas. And yet when Phaedrus shows his students low-quality work and high-quality work, they are able to agree on which is which; they can see quality, even though they can’t define it. First, says Phaedrus, you show your students what Quality is, and show them that they know how to judge it even if they can’t define it; then you introduce the writing rules, not as blind laws to be followed blindly, but as means to the end of Quality, which the students have now learned to see.

Cathryn: And is this a true story, or something the author just made up?

Autrey: I don’t know. Good question. One must avoid generalizing from fictional evidence, after all.

Eileen: Plenty of cognitive psychology papers discuss, in passing, people’s highly correlated judgments of qualities they would probably be extremely hard-pressed to define.

Autrey: Pirsig doesn’t like the idea that you must define things verbally – that only things you can define verbally are permitted as subjects of discussion. It can create a kind of blindness, people shutting out their own intuitions about Quality.

Eileen: In my view, your judgment of something’s Quality is the primary fact, and verbal definitions are attempted hypotheses about that fact. That we can see writing Quality is a fact. Attempts to give verbal definitions of “writing Quality” are attempts to make hypotheses about what it is, exactly, that we are seeing – hypotheses about the cognitive processes that underlie the judgment. For your hypothesis to interfere with your perception of the facts is, of course, a sin.

Cathryn: You’re saying what? That even if I don’t understand my own sense of “grown up”, it’s okay to have that sense and use it?

Eileen: The way in which you pass these judgments is an aspect of the universe, and in particular, cognitive science, which you have observed but not yet explained.

Autrey: Being unable to define what underlies your judgment of grown-up-ness, or moral improvement, or fairness, doesn’t mean that these things are demoted to some kind of second-rank existence. Even if you tried to give a verbal definition of, say, “fairness” – or as Eileen would put it, a hypothesis about fairness – you’d have to hold on very tightly to your intuitive judgment of fairness and continue checking your intuitive perception along with your verbal definition. In my experience, when people try and give off-the-cuff verbal definitions of such things, the definitions are usually wrong – or, at best, wildly inadequate. That is the danger of philosophy.

Eileen: Verbal definitions of our perceptions are usually inadequate because the real answer is a deep question of cognitive science, and people are trying to make up “philosophical” answers in English. Most times it ends up being like the various attempts to “define” what fire is in terms of phlogiston, stories about how the Four Elements run the universe, and so on. Today we know that fire is molecular chemistry, but that’s one heck of a nonobvious explanation behind what seems like a very simple sensory experience.

Dennis: I don’t buy Pirsig’s line about Quality. You claim that writing quality is a fact that has been observed, but not yet explained. What makes you think that there is an explanation for it, or that anyone will ever be able to give a verbal account of it? If people can’t give a verbal definition of a term, it must be because the term is fuzzy and arbitrary and useless.

Eileen: If a group of people agree on something that seems arbitrary, that itself is an interesting fact about cognitive science, and you should look for a common computation carried out in rough synchrony.

Autrey: Go players make their moves by judging a kind of “Go Quality” that has never been satisfactorily explained to anyone, and has not yet been embodied in any computer program. Go players have this mysterious, inexplicable sense of which moves are good, which Go configurations “feel” good or bad. They just pick it up from playing a lot of games. Playing from their unverbalizable sense of Go quality, those Go players beat the living daylights out of today’s best computers according to very clear, objective criteria for who wins or loses. If you’re a novice player playing chess, and you lose, you at least have some idea of what happened to you afterward; the other guy backed you into a corner and left you with no options and took all your pieces and so on. I’ve been losing a few games of Go here and there, and even after I lose I have no idea why. I put the stones on the board and then they go away. I know the rules and yet I don’t understand the game at all. Strong Go players have an unverbalizable sense of Quality that gives rise to specific, definite, useful results according to a clear objective criterion.

Eileen: I’d expect there’s an explanation for those perceptions in terms of how some areas of neural circuitry are trained by the experience of playing Go.

Autrey: Sure, but knowing that there exists an explanation is not the same thing as knowing the explanation.

Cathryn: I’m uncomfortable with leaving terms like “fairness” undefined. Maybe I can’t define why I’m uncomfortable, but I am.

Autrey: If you can’t define something you should be uncomfortable, because that is a real gap in your knowledge. But you can’t jump the gun. If you don’t know how to define something, then you don’t know. If you try to create a definition before you have the knowledge to build a good one, you end up defining fire as the release of phlogiston.

Eileen: In order to understand your own sense of fairness, you should begin by studying it, becoming aware of it, learning how it works, rather than trying to give a preemptive definition of it. The Quality of intelligence , for example, is a real thing that has been preemptively defined in so many wrong ways. The job of an AI researcher, in a sense, is to take things that appear as opaque Qualities, and figure out what really underlies them – not in loose hand-waving explanations, but in enough detail to create them. It’s the challenge of creation that’s the highest and most difficult test of an explanation.

Autrey: And here, of course, is where AI researchers really, really screw up. Whenever you hear an AI researcher defining X as Y, you should always hold on to your intuitive sense of X, and see whether the definition Y really matches it properly, explains the whole thing with no lingering residue. Especially when it comes to terms like “intelligence”. When you get an explanation right there should never be a feeling of forcing the experience to fit the definition, no sense of holding a mirror up to Life and chopping off the parts of Life that don’t fit.

Dennis: You stole that from Terry Pratchett.

Autrey: When I look at AI researchers’ explanations, I get the strong feeling that they’re trying to shove a Quality into a definition it really doesn’t fit at all, like someone trying to stuff their entire wardrobe into a 20″ luggage. So I’m going to stick with my intuitive understanding of these terms unless someone comes up with one heck of a good hypothesis.

Eileen: But that refusal doesn’t allow for any incremental progress. You can make a hypothesis that some effect contributes to our perception of a Quality, without claiming to have explained the whole thing. You shouldn’t summarily reject that attempt just because the hypothesis doesn’t explain the entire problem in itself. That’s why I agree with Pirsig that you shouldn’t demand a verbal definition before you start.

Cathryn: How can you talk about a problem at all, if you don’t have a definition of what you’re talking about?

Eileen: By using extensional definition instead of intensional definition. Extensional definition works by presenting one or more experiences from which a general property can be abstracted –

Autrey: Eileen? I’ll handle this. Cathryn, what is “red”?

Cathryn: It’s a color.

Autrey: What’s a “color”?

Cathryn: It’s a property of a thing.

Autrey: What’s a “thing”? What’s a “property”?

Cathryn: Um…

Autrey: That’s an example of intensional definition – trying to define words using other words. Now, to give an extensional definition, I’d say: “You see that traffic light over there? That’s ‘red’. You see that other traffic light over there? That’s ‘green’. The way in which they differ is ‘color’.”

Bernard: Of course, someone working from that definition alone might get confused and think “red” meant “top” and “green” meant “bottom”…

Eileen: Autrey’s extensional definition of “red” is ambiguous – it spreads out to encompass more possibilities than its maker intended. Whether you’re using extensional definition or intensional definition, or a mix of both, you have to make sure the map leads to only one place. And that is determined, not just by the map, but by the mind that follows the map. It sounds unambiguous to us – but only because we already know the desired answer! You have to watch out for that.

Autrey: Still, I usually find that extensional definition is to be preferred over intensional definition. Or as writers are admonished: “Show, don’t tell.” Otherwise you get lost in a maze of words that point to other words but never link to anything real.

Eileen: *cough*semanticnets*cough*

Cathryn: How would you give an extensional definition of morality?

Autrey: Well, you would point to a set of decisions and say which ones were and weren’t moral. And I would point to you and say, “See, that’s a morality.”

Cathryn: I don’t think that helped any.

Autrey: That particular answer doesn’t advance on the problem of de-opaquing “morality”, but it’s a way of pointing to the thing you want to investigate.

Cathryn: I think I’d be very nervous if an AI was looking at me , though. What if I got something wrong?

Bernard: Wrong? How can you get a moral judgment “wrong”?

Cathryn: Just watch me.

Bernard: Doesn’t getting something “wrong” require an external standard to compare it to?

Cathryn: In my experience, getting something wrong never requires anything more than a failure to pay attention.

Bernard: No, I mean… suppose you say you want an orange. How can you be “wrong” about that?

Cathryn: What do you mean? I’ve been wrong about what I wanted plenty of times. Sometimes I think my whole life has consisted of nothing else.

Bernard: That’s a philosophical impossibility.

Cathryn: I can screw up even when it is philosophically impossible for me to do so. That is the power and the curse of Cathryn, and lesser mortals can but look upon me in awe.

Autrey: That reminds me of something that’s been bugging me lately. Eileen, what’s a Friendly AI supposed to do, aside from dividing up this cake fairly? The cake-division problem we’re faced with may illustrate interesting things about fairness, but as a zero-sum game it doesn’t really reflect what life is about.

Eileen: I remind you that I can’t speak for a Friendly AI.

Autrey: Guess.

Eileen: In the beginning, when I thought about these kinds of moral questions, I used to think in terms like: “Is it better to be happy or sad? Is it better to be alive or dead? Is it better to be smart or stupid?” As it happens, I would come down on the happy/alive/smart side of the divide.

Dennis: More Culture propaganda.

Eileen: But then I started considering whether, if someone doesn’t want to be happy, it’s right to force them to be happy. As it happens, I would answer no. Having answered no, I found that I’d given individual autonomy and self-determination the deciding vote. This leads to the question of whether anything other than personal volition should even have a vote at all. Volition might “capture” happiness and life and smartness as special cases of self-determination – people’s choices to be happy, alive, and smart. But then, while I do think that people have the right to be sad if they want to be, I’m not as sure of that conclusion as I am about people’s right to be happy if they want to be. That suggests that even if my final conclusion is volitionism – people getting what they want – the morality behind the conclusion isn’t captured by volitionism alone. I guess you could sum up my present position by saying that I wouldn’t interfere with someone’s choice to be sad, but I would choose to be sad about it.

Autrey: That’s you. What about a Friendly AI?

Eileen: We don’t really have the language to describe what a Friendly AI is at this point in our conversation.

Autrey: Okay, but what does it work out into in practice?

Eileen: Knowing how to describe a thought isn’t the same as being able to think it yourself. What does the square root of 298304 work out to in practice?

Autrey: Probably around 550. What’s your guess for the behavior of a Friendly AI?

Eileen: I think that somewhere along the line, there’s going to be a deep principle that runs something like, “help people in accordance with their volitions”.

Autrey: Okay, this is the part that has always worried me about the scenario of ultrapowerful AIs helping humans. We just heard Cathryn say she’s frequently been wrong about what she wanted. Any sufficiently powerful friendship is indistinguishable from geniehood. Even if AIs are willing to help, do we know what to ask for? If wishes started coming true – even if they could only affect the person who made the wish and no one else – I think 98% of Earth’s population would destroy themselves within a week.

Cathryn: 98%? Even I think that’s pessimistic, and I speak as a woman who goes about beneath a constant cloud of doom.

Autrey: I’m not talking about deeply buried suicidal tendencies or any such psychoanalytic gibberish. I’m talking about the law of unintended consequences. People just can’t read that far ahead into the future. They would think a little about the “Wishing” problem, experience some anxiety over it, and make up some rules for themselves to follow, until they’d dealt with whatever anxieties they had. And then they would make their wishes and die, because their model wasn’t even close to reality.

Eileen: Speaking of “making up rules”, Autrey, why do you think that this is the way “help” should work? It seems to me like you’re making up rules for “wishing”, then extrapolating those rules out to where they fail. The criterion for how good rules ought to work is determined, not by the part of you making up these rules, but by the part of you looking over your own rules and seeing they won’t work.

Autrey: Eh?

Eileen: Let me put it this way: Suppose you find a genie bottle. And suppose that the genie in the bottle is a genuinely helpful person – your good friend who just happens to be a genie. We’ll leave aside for a moment the question of how to specify a genie like that. How would you want an ultraintelligent friend to react to your wishes?

Autrey: Uh… I don’t know.

Cathryn: I don’t know either!

Autrey: Thanks, Cath.

Cathryn: No charge.

Bernard: This all strikes me as rather speculative. Wouldn’t it make more sense to start by asking about ordinary, human-level help?

Autrey: No, effective omnipotence genuinely does strike me as the most probable outcome of [recursive self-improvement].

Bernard: I don’t believe that. But regardless, I’m asking about the rational order of addressing the problem.

Eileen: Analyzing the case of unlimited computing power makes the structure of the problem clearer. Power corrupts, but absolute power sure simplifies the math.

Bernard: Okay. I’d wish for the genie to look ahead into the future for me, and tell me if my wish has any awful unintended consequences.

Cathryn: I’d wish for the genie to fulfill the request I would have made if I could foresee the future.

Dennis: I’d wish to be the genie.

Autrey: You are all so dead.

Eileen: Hmm. I note that all three of you made a meta-wish, rather than specifying the properties of the mind that reacts to your wishes.

Cathryn: Interesting point. I guess I’ve read more short stories about people making wishes than people building genies.

Eileen: Bernard, how would a genie know whether you consider something an “awful unintended consequence”?

Bernard: No problem. We’re talking omniscience along with omnipotence, right? The genie can get a total scan of my every neuron down to the atomic level. The genie can use that to figure out what I would consider an “awful unintended consequence”. For example, suppose that I wish for the genie to bring me the nearest banana, but the nearest banana turns out to be rotten. I’d consider that an unintended consequence. So the genie should warn me.

Eileen: The genie not only has to read your mind, the genie also has to extrapolate your reactions to a situation you have not actually encountered. It may seem very straightforward to say that someone with your tastes will spit out a rotten banana in disgust – but there is an additional step involved in computing that, above and beyond having a physical readout of the state of your taste buds and taste-related neural circuitry. And while we’re on the subject, it takes an additional computational step to identify what is “you”, and your “sense of taste”, within the raw data of your physical readout.

Autrey: Any intelligence, any mind, is embodied in physics – protein machinery, molecules, atoms, ultimately quarks and electrons. If we’re talking about genuinely infinite computing power, and a full readout of the amplitude field over the Bernard subspace that constitutes Bernard’s quantum state – actually, that’s physically impossible. Can I assume it anyway?

Eileen: Sure. You couldn’t do that with an actual genie, but it’ll simplify the thought experiment.

Autrey: Given all that, then, I can precisely extrapolate Bernard’s physical state forward in time to determine his reaction to the rotten banana. If you allow for less than full omniscience, or less than infinite computing power, you get a probabilistic computation rather than a deterministic one. That’s essentially what I do in real life when I guess that Bernard will react badly to a rotten banana. All I need to know about Bernard is that he’s human. I don’t have a precise readout of his quantum state, or infinite computing power. Yet, uncannily, my guess is still correct. It’s as if I had this eerie ability to reason about the physical universe using limited information and bounded computing power, and make decisions under conditions of uncertainty.

Eileen: A physically detailed simulation of Bernard would be a real person, but we’ll ignore that consideration for the time being – if the problem is unsolvable even without ethical constraints on how the solution is computed, it’ll still be unsolvable after the constraints are added back in. If we know how the ideal solution would be computed using infinite computing power and no ethical constraints, we can then try and figure out how to ethically compute an approximation.

Autrey: Okay. I propose that the problem of extrapolating Bernard’s reaction is computable in principle and approximable in practice.

Cathryn: Objection: Physics is not deterministic. If you extrapolate Bernard forward, you’ll get a set of probabilities for different possible states, not a definite single state.

Autrey: Actually, I believe in the many-worlds formulation. Everett, Wheeler-DeWitt, yada yada. The many-worlds formulation is deterministic.

Cathryn: So you end with many different real Bernards, each with a different measure. How is that any better than ending up with many different possible Bernards, each with a different probability?

Autrey: I don’t see any significant difference that depends on which formulation of physics you use. What’s your point?

Cathryn: If there are many Bernards, who has the final say?

Autrey: The one with the greatest measure, or the greatest probability. No, wait, that’s not right. You should add up the different degrees of happiness or sorrow for each possible Bernard, multiplied by the probability or measure of that Bernard. Then that determines Bernard’s expected satisfaction with the wish.

Dennis: That doesn’t sound right to me. If I wish for a banana, and all of my possible future selves are chewing in bovine happiness, then sure, go ahead and fulfill the wish. But if there’s a 60% probability that my future self is wildly ecstatic about the banana, and a 40% probability that my future self runs screaming out of the room babbling about the Elder Gods, then I’d want to know what the clickety-clackety heck was up with that banana before I chowed down. Even if the average satisfaction is the same for both cases, they look very different to me.

Autrey: How would you measure the difference between those two cases? I’m sure there’s some standard statistical way of doing that, but offhand I can’t remember what –

Bernard: Measure the variance from the mean.

Eileen: Measure the Shannon entropy of the probability distribution.

Autrey: Oh, of course.

Dennis: I know that the variance is the average of the squared differences from the mean, but what’s the Shannon entropy?

Cathryn: Shannon entropy is a way of measuring uncertainty in probability distributions. For example, suppose you have a system A that is equally likely to be in any of 8 possible states [A1…A8]. The Shannon entropy of system A is log2(8), or 3 bits. Let’s suppose that another system B is equally likely to be in any of 4 possible states [B1..B4]; then B’s Shannon entropy would be 2 bits. What is the entropy of the combined system that includes A and B?

Dennis: The combined system can be in any of 32 states, so it has 5 bits of entropy.

Cathryn: Trick question!

Dennis: Of course.

Cathryn: It depends on whether there’s any mutual information between A and B. If the probabilities for A and B are independent , then system AB will have 32 possible states, any one of which is equally likely. But suppose we know that if system A is in an even-numbered state, B must be in an even-numbered state; while if system A is in an odd-numbered state, B must be in an odd-numbered state. The combined states [A2, B3] or [A7, B2] and so on have been ruled out. If we look at A alone, A could be in any of 8 possible states, which are equally likely; B alone could be in any of 4 possible states, which are equally likely; but when we look at A and B together, there are only 16 possible states the combined system could be in. A has 3 bits of entropy, B has 2 bits of entropy, and their mutual information is 1 bit, so the combined system AB has 4 bits of entropy. Similarly, learning whether A is odd or even tells us the parity of B, so learning A’s exact state reduces the number of states B could be in from 4 to 2, which reduces the entropy of B from 2 bits to 1 bit, so knowing A provides 1 bit of information about B.

Dennis: What if the probabilities for B are [B1: 1/2, B2: 1/4, B3: 1/8, B4: 1/8]?

Cathryn: The Shannon entropy of the system would be 1.75 bits, found by summing -P(X)log(P(X)) for all states X.

Dennis: Eh? How can a system have 1.75 bits of information in it? How do you store three-fourths of a bit? Write down only part of a ‘1’ or ‘0’? I’m having a hard time visualizing this. B has four states. If you needed to tell me which state it was in, you’d need to transmit ’00’, ’01’, ’10’, or ’11’. That’s two bits of information in each case.

Cathryn: But half the time B is in state B1. So suppose I use ‘1’ to indicate state B1, ’01’ to indicate state B2, and ‘000’ and ‘001’ to indicate state B3 and B4. If you look at that coding, you’ll see that it’s unambiguous – if the start of the sequence is clearly marked, you can always figure out where each symbol begins and ends. So half the time I transmit one bit, a quarter of the time I transmit two bits, and a quarter of the time I transmit three bits, which adds up to 1.75 bits for the average case. The arithmetic I just performed works out to summing -P(X)log(P(X)) for each possibility, which is the formal definition of the Shannon entropy.

Dennis: Wait, how does that fit with the definition you gave earlier, for systems A and B?

Cathryn: It’s the same definition. First you have the system A, which is 8 equally likely states, or 8 states each with probability 1/8. So you sum up 8 terms, each of which have value -(1/8)(log(1/8)). The end result is just -log(1/8) or log(8), which is 3. System B has 4 states each with probability 1/4, which works out to (4)(1/4)(-log(1/4)) or 2 bits of entropy by the same logic. Then when you work out the combined entropy of the independent systems A and B, you end up with (4*8)(1/4*1/8)(-log(1/4*1/8)), which, lo and behold, works out to -log(1/4*1/8) or -log(1/4) + -log(1/8). So when A and B are independent, the entropy of the combined system AB is equal to the entropy of A plus the entropy of B.

Eileen: You might want to play with the [properties] of Shannon entropy a bit. For our purposes, the important thing about Shannon entropy is that if you have a few very strong possibilities, the entropy is low; if you have a lot of weak possibilities, the entropy is high. So entropy behaves like “uncertainty”.

Cathryn: Entropy measures the volume of configuration space in which you might end up. If each possible state of the entire world is a single point in configuration space, then a volume in configuration space can represent your uncertainty about the state of the world. For example, if a system consists of three variables X, Y, Z that can take on continuous quantities, then you can represent any possible state of the system by a point in three-dimensional space. If you have a system of ten particles, each one of which has a three-dimensional position and a three-dimensional velocity, any possible state of that system can be represented by a point in a sixty-dimensional phase space. Since any possible state of that physical system can be represented by a point in phase space, a volume in phase space describes many different possible states of a physical system. So if you’re uncertain about what state a physical system is in, you’re uncertain about where the point is in the phase space. You can describe that uncertainty by drawing a border around the volume of phase space the point might occupy. The wider the border, the larger the volume, the greater your uncertainty, the greater the entropy. If you take, say, a bunch of gas molecules, and heat them up, then their range of possible velocities increases because they’re moving faster. So since the X, Y, Zs of velocity vary within a greater range, the volume in configuration space needs a larger border, and the entropy goes up. The hotter an object is, the more entropy it has.

Eileen: Entropy isn’t the volume in configuration space, it’s the logarithm of the volume of configuration space.

Cathryn: Er, yes. If you take the logarithm of the volume in configuration space, then your uncertainty about many independent systems combined, equals the sum of your uncertainties about each individual system of particles alone. The next step is, instead of marking each point in configuration space as “possible/impossible”, you mark it with the probability of ending up in that point. The resulting definition is equivalent to the Shannon entropy. If you’re dealing with continuous physical variables you get something called the distributional entropy, but it behaves pretty much the same way as the Shannon entropy. Or for quantum systems there’s the von Neumann entropy, but again, it’s pretty much the same as the Shannon entropy.

Dennis: I’ve always heard that entropy measures the amount of additional information you would need to know the exact state of a physical system.

Cathryn: It’s all equivalent. If a system has four bits of Shannon entropy, then the average length of a message that specifies the exact state of the system is four bits. Incidentally, the second law of thermodynamics is a consequence of a theorem which can be derived from either classical or quantum physical law, and this theorem, Liouville’s Theorem, says that probability is incompressible. If you have a volume of states in configuration space, and you extrapolate the evolution of that volume forward in time under our physical laws, then you must end up with exactly the same volume you started with. If you imagine starting with a nice, neat, compact blob in configuration space, then if you extrapolate each point in the blob forward in time 5 minutes, you have to end up with exactly the same total volume at the end. It might not be a nice, neat, compact volume, though. It might be a squiggly volume with lots of tentacles. If you didn’t keep track of all the exact squiggles, which would be a lot of work, you’d have to draw a much bigger blob to surround everything. So “entropy”, the size of the blob you draw to capture your uncertainty about the system, usually increases and never decreases. That’s why no one can ever build a perpetual motion machine, no matter how clever they are with wheels and gears, if their wheels and gears are governed by the physics we know. A perpetual motion machine that takes in hot water and produces electricity and ice cubes is a physical process that maps a great big blob in configuration space into a little tiny blob, and the incompressibility theorem says this is impossible. If you want to squeeze a big blob onto a tiny blob in one subsystem, another subsystem somewhere has to bloat up from a tiny blob into a big blob, so that the phase space volume of the total system is conserved. If you have a physical process such that subsystem B develops from 4 possible starting states into 1 final state, then some other subsystem, say the A subsystem, has to develop from 1 definite starting state into 4 possible final states. And those 4 final states of A will probably be spread out so much that we have to describe A using the range [A1..A8]. You can move entropy around from one subsystem to another, but you can never actually reduce entropy. You can freeze water into ice cubes, but you need somewhere to dump that heat, plus whatever additional heat was generated by the work involved. Hence thermodynamics.

Dennis: Cool. Back to wishes.

Eileen: If your response to the banana is pretty much localized in a single, definite high-level reaction like “Mm, nice banana”, then your reaction is definite; it has low entropy. If your different possible futures contain a very wide range of reactions, then your response is uncertain; it has high entropy.

Autrey: Ah, very nice.

Bernard: I would still prefer to compute the variance. Taking the discrete Shannon entropy assumes that each possible outcome is entirely distinct; it doesn’t take into account the distance between possibilities. Many similar reactions should count as less variance than a few widely different reactions. Two equiprobable, widely separated reactions should count as more uncertainty in a wish than eight equiprobable reactions in a neat local cluster.

Eileen: I tend to think of reactions as distinct possibilities with complex internal structure. But if there’s a clearly defined distance metric between different possible reactions, then yeah, the variance might be a better measure. Perhaps both the variance and the entropy are too simple to really capture our intuitive definition of the uncertainty in our reaction to an event, but either definition would be a good place to start. Since I need to pick a term and stick with it, I’m going to talk about the spread in a person’s reaction to an event.

Cathryn: Okay, what’s spread?

Eileen: I tend to think of spread as very similar to entropy – in fact, I formerly used the word “entropy” – because when I think of spread, I visualize an uncertain volume of possibilities. But the spread in Bernard’s reactions is not the same as his physical entropy, because Bernard’s “satisfaction” is a high-level characteristic of the Bernard subsystem, and it’s computed in a lossy way. Two possible Bernard microstates can have the same “reaction” for our purposes, and yet be different in other ways we don’t care about. We won’t distinguish between a vast number of different Bernard microstates which are all “satisfied”; that’s physical entropy which isn’t counted into the spread. We won’t distinguish between equally “satisfied” Bernards with electron #whatever in spin-up or spin-down. The two states have the same utility from a moral perspective – they are, from our perspective, effectively interchangeable. You can define a complicated subspace of the physical configuration space, not by considering a subset of the particles and their associated dimensions in the configuration space, but by defining the classes of physical states that are interchangeable with respect to your decision system. Then you can consider the entropy with respect to the superspace of those subspaces, apart from the physical entropy.

Cathryn: Then what’s the relation between spread and entropy? Or are they related at all, under this definition?

Eileen: Not all physical entropy can be interpreted as spread. However, all spread necessarily implies physical entropy. If Bernard could end up “screaming in horror” or “wildly ecstatic”, then that spread requires some amount of physical entropy. Different physical microstates can map to the same “reaction”, but different Bernard reactions must map to different physical microstates. For there to be spread in Bernard’s reaction, there must be entropy in the volume of possible physical states Bernard could end up in. All spread implies entropy, but not all entropy implies spread. I think of “spread” as a partial measurement of the entropy of a system – you’re measuring only a particular kind of entropy that you care about, or the entropy with respect to a particular partitioning of the system into subspaces of interchangeable states. But if you used something other than the Shannon entropy to define the spread, like the variance, it might not be appropriate to call the spread a partial measurement of entropy. Spread would still imply entropy, but there might not be any simple relation between the two measures.

Bernard: I am reminded of a quote by Morrowitz: “If you don’t understand something and want to sound profound, use the word ‘entropy’.”

Cathryn: To measure the spread in Bernard’s reaction, you need a way to measure Bernard’s reaction. That’s the next question, right?

Eileen: Right. Even given that you have an atomically detailed specification of Bernard’s future state, how do you compute his “reaction”?

Autrey: Check to see if Bernard murmurs, “Mm, nice banana.”

Eileen: How do you check for that murmur, and rate it as appreciation, and so on? What if Bernard doesn’t murmur anything?

Autrey: In answer to your first question, use voice recognition and natural language interpretation, and emotion recognition trained on past responses. I’m not saying you’d do it that way, I’m just trying to show it can be done. For the second question, maybe you could assume that if Bernard doesn’t murmur anything, he’s satisfied.

Bernard: Not necessarily. Quiet suffering is a hobby of mine.

Autrey: Then I’ll say the obvious: The genie needs to read Bernard’s mind within his extrapolated physical state, recognize the extrapolated thoughts within the extrapolated neurons, and guess what Bernard will think of the banana. And, let me guess, your next objection is that I haven’t defined what “Bernard’s thoughts” are, or how to read them.

Eileen: Got it in one.

Bernard: All of Earth’s neurologists put together couldn’t say how to do that. What you’re asking for is unreasonable.

Autrey: If so, then you can’t build a genie. The inherent difficulty of that problem doesn’t care whether solving it is “unreasonable”; it’s just there. The forces that establish the difficulty of the problem are quite independent of the forces that establish the resources you have available to solve it; it would be an amazing coincidence if the two should exactly balance.

Eileen: Autrey, suppose we allowed a solution to the technical problem, but only the technical side. In other words, you can tell me that you want to read out the words from Bernard’s linguistic stream of consciousness, or you want to read out a mental image in Bernard’s visual cortex, or you want to read the activation level of some particular emotion, and the genie can do all that, but you still need to say what the genie does with that information. Even given that, how would you compute Bernard’s “reaction”?

Autrey: Er… okay, good question.

Eileen: There’s not much point in complaining about technical impossibilities if you don’t know what you’re trying to accomplish.

Autrey: How about this? I don’t claim to know exactly what it means for Bernard to be “satisfied” or “dissatisfied” with a wish. The definition of that might even vary from person to person. But I can monitor signs that are associated with dissatisfaction. Like, if Bernard makes a disgusted face, or says “Yuck”. Or I can monitor if Bernard’s emotion of disgust activates, or if his pain centers light up; you said I could assume that capability.

Dennis: Or if Bernard says, “Curse you genie! This banana is rotten! Rotten! Oh, foul accursed thing! What demon from the depths of hell created thee!”

Autrey: Um… yeah, that’d also be a good sign. I mean a bad sign. It would be a strong indicator of a negative outcome.

Eileen: Now, let me ask this: If we granted that definition… which isn’t really well-specified, but we’ll grant it anyway… if we granted that definition, would you feel safe in making a wish?

Autrey: No.

Bernard: Why not?

Autrey: Because the definition has holes in it.

Bernard: Eh? What does it mean for a definition to have holes?

Autrey: I mean a definition that covers some bad wishes, but not all possible bad wishes. That’s always been my problem with verbal definitions. If you define ‘man’ as a featherless biped, what about someone who’s missing a leg, or a plucked chicken? If “birds” fly, what about ostriches and penguins?

Dennis: Words mean whatever you say they mean. If you define man as a featherless biped, then a plucked chicken is a “man”.

Eileen: There’s a subfield of cognitive science that studies categories. Lakoff and Johnson, for example, would say that categories have “radial structure”. Robins are in the center of the bird category; they have most of the properties that are associated with birds; they are typical for birds. Ostriches and penguins are less typical; they’re distant from the center.

Autrey: Words are nets. They may catch a few big, easy facts, but smaller truths slip through. Imagine that we have a configuration space for physical objects – call it “thingspace”. We’ll take every physical object that exists on Earth, and map it onto a point in thingspace. Now imagine that you take each thing in thingspace, and you rate it for birdiness – the degree to which this thing appears to be a bird. The result would be a map of your ‘bird’ category.

Cathryn: A map of a category? What would that look like?

Autrey: Most of the things that are birds would be clustered together – robins, hummingbirds, pigeons, and so on. That would be the central cluster, radiating brightly with the light of your high ‘bird’-ness rating. There would also be some smaller, noncentral clusters, radiating out from the central cluster, like ostriches and penguins, still glowing, but perhaps slightly less brightly.

Cathryn: So the brightness of the light is determined by the degree of category membership.

Autrey: Right. The maps of your categories are galaxies, bright cores but with a scattering of other structures nearby. There’s also the empirical cluster structure of things themselves to be taken into account. Robins and pigeons, for example, might both be equally birds – might both glow equally brightly – but they are distinguishable subclusters within the bird supercluster. Robins are all more similar to other robins than they are to pigeons, pigeons are all more similar to other pigeons than they are to robins. So in thingspace, all the robins would gather into a tight little subcluster that was squarely within the center of the larger bird supercluster, glowing brightly with birdness, but still disguishable from the nearby pigeon subcluster. I use cluster structure to refer to the distribution of actual things on Earth, the way the points in thingspace cluster together. I use category structure to refer to the way that categories work in your mind – the way that the glow of a word is distributed across thingspace, and any conceivable points in thingspace.

Eileen: And, since the word is not the thing itself, category structure doesn’t always match cluster structure.

Autrey: That wasn’t quite the point I wanted to make, but it’s true. You can create categories that lump together things that are really quite distant, and if so, it will create errors in how you think about the problem. You’ll treat things as the same when they’re really different, and generalize without realizing it. Words carve reality, and poor words fail to carve at the natural joints. Categories, as we learn them from experience of which things are similar to other things, can be quite complicated. There’s a vast amount of neural circuitry that’s been trained to recognize things, and that neural circuitry works in complex ways. When we try to give a verbal definition of the category, we’re trying to simplify that information – compress a tremendous amount of richness into a very small space.

Dennis: Compress…?

Autrey: You can say that “birds fly”, for example, but if you look at thingspace when it’s lit up by “bird”, and then look again when thingspace is lit up by “things that fly”, you’ll find that there are clusters that glow with birdness, but not flightness. The penguin cluster, for example. Although most things that glow with birdness glow with flightness, “flight” alone is not enough for even an approximation of “bird”, because so many things fly that are not birds; insects, for example, or airplanes.

Cathryn: Okay, so what would be a better definition of “bird”?

Autrey: You could take the intersection of “flight” and something else; “feathers”, for example. If you intersected “flight” and “feathers”, for example, if you said that “a bird is a feathered flying thing”, then thingspace might light up with a glow that intersected most birds, and mostly birds. But not all birds, and only birds. The glow of “feathered things that fly” would not be exactly the same as the glow of “bird”. The verbal definition you gave for “bird” is an imperfect approximation of the “bird” category that exists in your mind.

Eileen: And sometimes both the verbal definition and the cognitive category structure disagree with the actual cluster structure. “When the bird disagrees with the bird book, believe the bird.” Hence Autrey’s complaint.

Autrey: I think my complaint is more than that. A lot of times it seems that verbal definitions don’t even strike at the essence of the thing they try to define; they just describe symptoms. Plato’s definition of a human as a featherless biped, for example. Maybe a lot of humans light up with that description, but it doesn’t catch any interesting part of what it means to be human. If you used a more sophisticated version of that description, something that described, say, the anatomy of the kidneys and so on, until you had a purely anatomical description which happened to apply to all humans and only humans, a “perfect” definition by most lights, you still wouldn’t have said anything interesting about what it means to be human.

Cathryn: Now if you were to start describing the anatomy of the brain …

Bernard: Suppose that you have a “perfect” definition, one that covers all things and only the things you want it to cover. How could any alternate definition be better? What’s wrong with describing humans using the anatomy of the kidneys?

Autrey: The anatomical definition would fail if, say, you had someone with an artificial kidney, who still had a human brain and behaved like a human. Even if right now no specific humans like that exist, the categories would still have different glows when mapped across the whole of thingspace, and not just those particular points in thingspace that we’ve encountered so far. If the definition is different from the word, you can break it with a thought experiment, even if there are no physically realized counterexamples at this moment in time.

Dennis: And what does this have to do with Friendly AI?

Autrey: A point where the glows don’t overlap is a hole in the net, a loophole in the definition, a place where the safeguard fails. When you add on definitions like “Watch Bernard’s facial expression” or “See if Bernard shrieks in horror” or “monitor the levels of disgust and pain in Bernard’s brain”, you’re catching some, but not all, cases of wishes gone horribly wrong. You’re constructing a net, and the net has holes.

Eileen: But this isn’t just a categorization failure. The deeper problem is that you’re weaving the net out of effects of wishes that go wrong – symptoms instead of causes. You’re imagining something that might go wrong, and then imagining a probable consequence of the mistake, and telling the genie to check for the consequence in order to detect the mistake. You didn’t tell the genie which criteria you used to mentally determine what constituted a mistake in the first place. You only told the genie about something you thought might be a probable consequence or correlate of a mistake – you told the genie about the symptom but not the disease. Of course, some symptoms strike closer to the heart than others – monitoring Bernard’s brain for signs of a disgusted taste reaction, for example.

Bernard: Check the extrapolated Bernard for a grimace, a verbal objection, or feelings of pain and disgust… looks fine to me. The definition given should catch most problems. I don’t demand perfection, as long as it’s good enough.

Autrey: That’s how you would deal with a genie ?

Bernard: Can you name a specific problem with the suggested definition?

Autrey: Even if I couldn’t name a specific problem, it doesn’t mean you’d be safe. Just because there’s no obvious problems doesn’t mean there are no problems. Availability is not the same as probability.

Eileen: You’re using a negative safety strategy instead of a positive safety strategy – you’re assuming that success is the default and defining ways to detect failure, instead of defining unique signs that indicate success.

Bernard: You still haven’t pointed out a specific problem with my plan.

Dennis: Suppose it’s a poison banana that instantly kills you. Then you don’t murmur an objection, you don’t grimace, you don’t even feel pain and disgust; you just fall over dead.

Bernard: Ah, okay. So we need to patch the definition so it also catches fatalities. Those are undesirable too.

Autrey: This approach is guaranteed to fail. You can “patch” a definition enough that there are no longer any holes obvious to you. You cannot “patch” a definition enough that it does not actually contain any holes.

Eileen: You have placed yourself into a situation where you are testing your wits against the genie’s, and that in itself is a mistake. If you’re smart enough to predict all the ways that the genie might try to fulfill your wish, you may be able to create a definition that covers all the holes, if simple carelessness or error doesn’t trip you up. But not if the genie is smart enough to work in ways you didn’t think of.

Bernard: Okay, so how do you build a genie, then?

Autrey: Maybe you don’t . You can’t start from the assumption that there must be a way for you to build a genie, and then reason backwards from there. If you run into an obstacle you can’t solve, then you can’t build a genie. That’s all there is to it.

Eileen: I’ve written a bit about [this sort of problem] , where intelligence appears to require knowledge of an infinite number of special cases. Consider a CPU that adds two 32-bit numbers. On one level of organization, you can regard the CPU as adding two integers to produce a third integer. On a lower level of organization, two structures of 32 bits collide, under certain rules which govern the local interactions between bits, and the result is a new structure of 32 bits. Since we have a deep understanding of arithmetic, it is not very difficult for us to produce such a CPU. But consider the woes of a research team that doesn’t really understand what arithmetic is about, with no knowledge of the CPU’s underlying implementation, that tries to create an arithmetic “expert system” by encoding a vast semantic network containing the “knowledge” that two and two make four , twenty-one and sixteen make thirty-seven , and so on. In this hypothetical world where no one really understands addition, we can imagine the “common-sense” problem for addition; the launching of distributed Internet projects to “encode all the detailed knowledge necessary for addition”; the frame problem for addition, where the sum of one number depends on what other number you add it to; the philosophies of formal semantics under which the LISP token thirty-seven is meaningful because it refers to thirty-seven objects in the external world; the design principle that the token thirty-seven has no internal complexity and is rather given meaning by its network of relations to other tokens; the “number grounding problem”; the skeptics who write books about how machines can simulate addition but never really add ; the hopeful futurists arguing that past projects to create Artificial Addition failed because of inadequate computing power… you get the idea.

Bernard: Right. What you need is an Artificial Arithmetician which can learn the vast network of relations between numbers that humans unconsciously acquire during their childhood.

Dennis: No, you need an Artificial Arithmetician that can understand natural language, so that instead of the AA having to be explicitly told that twenty-one and sixteen make thirty-seven , it can get the knowledge by exploring the Web.

Autrey: Frankly, it seems to me that you’re just trying to convince yourselves that you can solve the problem. None of you really know what arithmetic is, so you’re floundering around with these generic sorts of arguments. “We need an AA that can learn X”, “we need an AA that can extract X from the Internet”. I mean, it sounds good, it sounds like you’re making progress, and it’s even good for public relations, because everyone thinks they understand the proposed solution – but it doesn’t really get you any closer to general addition, as opposed to range-specific addition in the twenties and thirties and so on. Probably we will never know the fundamental nature of arithmetic. The problem is just too hard for humans to solve.

Cathryn: That’s why we need to develop a general arithmetician the same way Nature did – evolution.

Bernard: Top-down approaches have clearly failed to produce arithmetic. We need a bottom-up approach, some way to make arithmetic emerge.

Kurzweil: I believe that machine arithmetic will be developed when researchers scan each neuron of a complete human brain into a computer, so that we can simulate the biological circuitry that performs addition in humans. This will occur in the year 2026, on October 22nd, between 7:00 and 7:30 in the morning.

Searle: Let me tell you about my Chinese Calculator Experiment –

Eileen: Wait! Stop. That’s not what I was trying to say. My point, Autrey, is that you have an internal process that you yourself use to determine, in these thought experiments, whether or not a wish has gone “wrong”. Then you try to build a definition that you think will catch catastrophic wishes, by imagining wishes that you know to be catastrophic, and imagining consequences that you could tell the genie to look for. But if you do that, you’re giving the genie a different procedure to follow than you yourself use – the genie doesn’t share the same definition, the same cognitive computation, that you yourself are using to decide which cases are catastrophic in the first place. That’s what you need to transfer over. Not your carefully built nets, but the thoughts you’re using to construct those nets.

Cathryn: Ah, I see. Now what you said earlier, about needing to transfer a dynamic cognitive process instead of frozen outputs, is starting to make sense.

Autrey: Okay, that gives me an idea. Suppose that instead of trying to monitor Bernard’s extrapolated reaction to the banana, we ask Bernard directly whether he was satisfied with his wish or not. “Yes” means the wish succeeded. “No”, or silence, means the wish failed.

Bernard: Wouldn’t being asked that question constantly, after every single wish, become irritating after a while?

Autrey: No, you simulate Bernard being asked the question. By hypothesis, the genie has both the computing power and the knowledge to do that, right down to the quark level. Then you use Bernard’s simulated answer. You don’t need to read Bernard’s thoughts from his extrapolated physical state, all you need to do is recognize the spoken words “Yes” or “No”. Bernard can take into account anything he wants in answering “Yes” or “No” – the simulated Bernard, I mean. His level of disgust, or even a vague feeling of existential ennui at having to eat yet another banana. We don’t need to make up an elaborate definition of success or failure; we’ll use Bernard’s definition, by simulating him. Perfect coverage, no mismatches.

Dennis: What about a poison banana? Then Bernard doesn’t say anything.

Autrey: The simulated Bernard has to specifically say “Yes”. If he says nothing, it counts as a “No”.

Dennis: What if “Gaaack” sounds closer to “Yes” than “No”?

Autrey: When in any doubt, treat it as a “No”. You’re not checking for the presence of failure, you’re checking for the presence of success.

Dennis: What if something happens that tricks Bernard into saying “Yes”, like some other question being asked at around the same time?

Autrey: Okay, good point. The simulated Bernard has to say: “Yes, I’m satisfied with my wish.” It occurs to me, though, that now I’m in “tweaking mode” again, and that means I’m doomed. I can get it to the point where I can no longer see the flaws, but not to the point where there are no flaws.

Eileen: I liked the simulation theory, though. You could get a whole readout of someone’s judgment that way. If you wanted a snapshot of the meaning of the word “bird” in Bernard’s mind, you could ask him to rate the “birdiness” of every possible kind of thing – not just things that exist, but things that don’t – by simulating encounters, each simulation independent, and asking him to rate on a scale from 1 to 100. You’d lose subtleties that way, but you could get a quantitative mapping over thingspace – a picture of the glow.

Autrey: That would take a hell of a lot of computing power.

Cathryn: We were talking about simulating Bernard on a quantum scale. That’s already a hell of a lot of computing power, to within the same circle of hell, more or less.

Eileen: This is a way to define a judgment function. It’s not the same as having the underlying computation, but it gets you a snapshot of the outputs. Then you can talk about the entropy in a judgment – pardon me, I mean, “the spread”. You can talk about the spread in a judgment. You can talk about the spread due to quantum uncertainty (or diverging worlds). You can talk about a range of possible physical initial conditions, and the resulting spread – say, the spread with respect to the range of possible initial conditions created by thermal uncertainty of motions. You can talk about spread with respect to showing Bernard the putative “bird” at time t = 3 seconds or time t = 4 seconds or any nanosecond interval in between. You can talk about how Bernard’s judgment function changes over time. And you can talk about someone’s judgment function for a Quality that they don’t know how to define.

Bernard: You could take partial derivatives of the judgment function with respect to a set of quantitative variables controlling the presentation of the putative bird.

Autrey: Why would you want to?

Bernard: Don’t you ever do anything just for fun? Besides, you don’t understand an equation until you’ve partially differentiated it with respect to something.

Cathryn: Eileen, I’m having problems with the whole idea that your definition of the judgment function, using Bernard’s simulated output, captures Bernard’s real judgment. I don’t know if this definition would capture my real judgment, the criteria I use to determine whether I would want my wish to be carried out. I think maybe the whole idea may be on the wrong track.

Autrey: Why?

Cathryn: I’m having trouble putting it into words.

Autrey: That may not make your objection invalid, but it doesn’t exactly make it valid, either.

Cathryn: Yeah, I know. Look, suppose that I wish for the banana, and get the banana, and I’m satisfied with the banana, and yet, nevertheless, I wished for the wrong thing.

Bernard: But that’s sheer nonsense. What criterion are you using to determine bad-wish-ness, if not your own satisfaction with the wish?

Cathryn: That’s just my point. Maybe my satisfaction with the wish isn’t the right criterion to determine whether the wish is good or bad. To say nothing of my verbal expression of that satisfaction.

Autrey: Okay, then how would you determine whether a given “criterion of bad-wish-ness” is right or wrong? I mean, how would you choose between criterions for judging wishes?

Eileen: Heh. That question is FAI-complete.

Dennis: That question ought to be taken out and shot.

Cathryn: I don’t know. I’d judge the criterion’s Quality.

Autrey: (Sigh.) It may have been a bad idea to introduce that concept if it’s going to be abused like that.

Eileen: I wouldn’t call it abuse. I’d call it a call to investigation.

Autrey: Look, saying that you’re judging a Quality isn’t really much different from saying that you’re judging a ‘thingy’. Calling something a Quality doesn’t explain it.

Eileen: Calling something a “dependent variable” doesn’t explain it either, but it can be a useful conceptual tool in designing the investigation. From my perspective, Cathryn just said: “I’m making this judgment but I don’t know how. Please investigate me.”

Cathryn: I guess I’d go along with that.

Autrey: Is there anything you can put into words about why you’re dissatisfied with “satisfaction” as a wish criterion?

Cathryn: No, not really. I just have a bad gut feeling about it, like I’m being fast-talked into something.

Autrey: Okay, can you think of a counterexample? An example of a “satisfactory” wish that you wouldn’t want to see carried out?

Cathryn: Um…

Autrey: If you can’t give a clever and incisive counterexample, don’t hesitate to give a stupid one.

Cathryn: What?

Autrey: I’m serious. Even a stupid counterexample can help tell us what you’re thinking. Free-associate. Give us a hint.

Cathryn: All right. Suppose I wished for the genie to grab an ice cream cone from a little girl and give it to me. Now it might be a really delicious and satisfying ice cream cone, but it would still be wrong to take the ice cream cone away from the little girl. Isn’t your definition of satisfaction fundamentally selfish?

Dennis: I’ll say! I should get the ice cream cone.

Bernard: Well, of course, the so-called altruist is also really selfish. It’s just that the altruist is made happy by other people’s happiness, so he tries to make other people happy in order to increase his own happiness.

Cathryn: That sounds like a silly definition. It sounds like a bunch of philosophers trying to get rid of the inconvenient square peg of altruism by stuffing it into an ill-fitting round hole. That is just not how altruism actually works in real people, Bernard.

Autrey: I wouldn’t dismiss the thought entirely. The philosopher Raymond Smullyan once asked: “Is altruism sacrificing your own happiness for the happiness of others, or gaining your happiness through the happiness of others?” I think that’s a penetrating question.

Eileen: I would say that altruism is making choices so as to maximize the expected happiness of others. My favorite definition of altruism is one I found in a [glossary of Zen] : “Altruistic behavior: An act done without any intent for personal gain in any form. Altruism requires that there is no want for material, physical, spiritual, or egoistic gain.”

Cathryn: No spiritual gain?

Eileen: That’s right.

Bernard: That sounds like Zen, all right – self-contradictory, inherently impossible of realization. Different people are made happy by different things, but everyone does what makes them happy. If the altruist were not made happy by the thought of helping others, he wouldn’t do it.

Autrey: I may be made happy by the thought of helping others. That doesn’t mean it’s the reason I help others.

Cathryn: Yes, how would you account for someone who sacrifices her life to save someone else’s? She can’t possibly anticipate being happy once she’s dead.

Autrey: Some people do.

Cathryn: I don’t. And yet there are still things I would give my life for. I think. You can’t ever be sure until you face the crunch.

Eileen: There you go, Cathryn. There’s your counterexample.

Cathryn: Huh?

Eileen: If your wish is to sacrifice your life so that someone else may live, you can’t say “Yes, I’m satisfied” afterward.

Autrey: If you have a genie on hand, you really should be able to think of a better solution than that.

Eileen: Perhaps. Regardless, it demonstrates at least one hole in that definition of volition.

Bernard: It is not a hole in the definition. It is never rational to sacrifice your life for something, precisely because you will not be around to experience the satisfaction you anticipate. A genie should not fulfill irrational wishes.

Autrey: Cathryn knows very well that she cannot feel anything after she dies, and yet there are still things she would die for, as would I. We are not being tricked into that decision, we are making the choice in full awareness of its consequences. To quote Tyrone Pow, “An atheist giving his life for something is a profound gesture.” Where is the untrue thing that we must believe in order to make that decision? Where is the inherent irrationality? We do not make that choice in anticipation of feeling satisfied. We make it because some things are more important to us than feeling satisfaction.

Bernard: Like what?

Cathryn: Like ten other people living to fulfill their own wishes. All sentients have the same intrinsic value. If I die, and never get to experience any satisfaction, that’s more than made up for by ten other people living to experience their own satisfactions.

Bernard: Okay, what you’re saying is that other people’s happiness is weighted by your goal system the same as your own happiness, so that when ten other people are happy, you experience ten times as much satisfaction as when you yourself are happy. This can make it rational to sacrifice for other people – for example, you donate a thousand dollars to a charity that helps the poor, because the thousand dollars can create ten times as much happiness in that charity as it could create if you spent it on yourself. What can never be rational is sacrificing your life , even to save ten other lives, because you won’t get to experience the satisfaction.

Cathryn: What? You’re saying that you wouldn’t sacrifice your own life even to save the entire human species?

Bernard: (Laughs.) Well, I don’t always do the rational thing.

Cathryn: Argh. You deserve to be locked in a cell for a week with Ayn Rand.

Autrey: Bernard, I’m not altruistic because I anticipate feeling satisfaction. The reward is that other people benefit, not that I experience the realization that they benefit. Given that, it is perfectly rational to sacrifice my life to save ten people.

Bernard: But you won’t ever know those ten people lived.

Autrey: So what? What I value is not “the fact that Autrey knows ten people lived”, what I value is “the fact that ten people lived”. I care about the territory, not the map. You know, this reminds me of a conversation I once had with Greg Stock. He thought that drugs would eventually become available that could simulate any feeling of satisfaction, not just simple ecstasy – for example, drugs that simulated the feeling of scientific discovery. He then went on to say that he thought that once this happened, everyone would switch over to taking the drugs, because real scientific discovery wouldn’t be able to compare.

Cathryn: Yikes. I wouldn’t go near a drug like that with a ten-lightyear pole.

Autrey: That’s what I said, too – that I wanted to genuinely help people, not just feel like I was doing so. “No,” said Greg Stock, “you’d take them anyway, because no matter how much you helped people, the drugs would still make you feel ten times better.”

Cathryn: That assumes I’d take the drugs to begin with, which I wouldn’t ever do. I don’t want to be addicted. I don’t want to be transformed into the person those drugs would make me.

Autrey: The strange thing was that Greg Stock didn’t seem to mind the prospect. It sounded like he saw it as a natural development.

Cathryn: So where’d the conversation go after that?

Autrey: I wanted to talk about the difference between psychological egoism and psychological altruism. But it was a bit too much territory to cover in the thirty seconds of time I had available.

Dennis: Psychological egoism and psychological altruism? Eh?

Eileen: The difference between a goal system that optimizes an internal state and a goal system that optimizes an external state.

Cathryn: There’s a formal difference?

Eileen: Yes.

Bernard: No.

Cathryn: Interesting.

Autrey: In philosophy, this is known as the egoism debate. It’s been going on for a while. I don’t really agree with the way the arguments are usually phrased, but I can offer a quick summary anyway. You want one?

Dennis: Yeah.

Autrey: Okay. Psychological egoism is the position that all our ultimate ends are self-directed. That is, we can want external things as means to an end, but all our ultimate ends – all things that we desire in themselves rather than for their consequences – are self-directed in the sense that their propositional content is about our own states.

Eileen: Propositional content? Sounds rather GOFAI-ish.

Autrey: Maybe, but it’s the way the standard debate is phrased. Anyway, let’s say I want it to be the case that I have a chocolate bar. This desire is purely self-directed, since the propositional content mentions me and no other agent. On the other hand, suppose I want it to be the case that Jennie has a candy bar. This desire is other-directed, since the propositional content mentions another person, Jennie, but not myself. Psychological egoism claims that all our ultimate desires are self-directed; psychological altruism says that at least some of our ultimate desires are other-directed.

Bernard: If you want Jennie to have a candy bar, it means that you would be happy if Jennie got a candy bar. Your real end is always happiness.

Autrey: That’s known as psychological hedonism, which is a special case of psychological egoism. As Sober and Wilson put it, “The hedonist says that the only ultimate desires that people have are attaining pleasure and avoiding pain… the salient fact about hedonism is its claim that people are motivational solipsists ; the only things they care about ultimately are states of their own consciousness. Although hedonists must be egoists, the reverse isn’t true. For example, if people desire their own survival as an end in itself, they may be egoists, but they are not hedonists.” Another quote from the same authors: “Avoiding pain is one of our ultimate goals. However, many people realize that being in pain reduces their ability to concentrate, so they may sometimes take an aspirin in part because they want to remove a source of distraction. This shows that the things we want as ends in themselves we may also want for instrumental reasons… When psychological egoism seeks to explain why one person helped another, it isn’t enough to show that one of the reasons for helping was self-benefit; this is quite consistent with there being another, purely altruistic, reason that the individual had for helping. Symmetrically, to refute egoism, one need not cite examples of helping in which only other-directed motives play a role. If people sometimes help for both egoistic and altruistic ultimate reasons, then psychological egoism is false.”

Dennis: The very notion of altruism is incoherent.

Autrey: That argument is indeed the chief reason why some philosophers espouse psychological hedonism.

Cathryn: Sounds like a lot of silly philosophizing to me. Does it really matter whether I’m considered a “motivational solipsist” or whatever, as long as I actually help people?

Bernard: That’s just it! It doesn’t make any operational difference – all goal systems operate to maximize their internal satisfaction, no matter what external events cause satisfaction.

Eileen: That’s not true; it does make an operational difference. If Autrey values the solipsistic psychological event of knowing he saved ten lives, he will never sacrifice his own life to save ten other lives; if he values those ten lives in themselves, he may. You told him that, remember?

Bernard: Well, I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives, more than he values all the happiness he might achieve in the rest of his life.

Cathryn: That doesn’t sound anything remotely like the way real people think. Square peg, round hole.

Autrey: Do you have anything new to contribute to the debate, Eileen? It’s a pretty ancient issue in philosophy.

Eileen: The basic equation for a Bayesian decision system is usually phrased something like D(a) = Sum U(x)P(x|a). This is known as the expected utility equation, and it was derived by von Neumann and Morgenstern in 1944 as a unique constraint on preference orderings for all systems that obey certain [consistency axioms] –

Dennis: Start over.

Eileen: Okay. Imagine that D(a) stands for the “desirability” of an action A, that U(x) stands for the “utility” of a state of the universe X, and P(x|a) is your assigned “probability” that the state X occurs, given that you take action A. For example, let’s say that I show you two spinner wheels, the red spinner and the green spinner. One-third of the red spinner wheel is black, while two-thirds of the green spinner wheel is white. Both spinners have a dial that I’m going to spin around until it settles at random into a red or black area (for the red spinner) or a white or green area (for the green spinner). The red spinner has a one-third chance of turning up black, while the green spinner has a two-thirds chance of turning up white. Let’s say that I offer you one of two choices; you can pick the red spinner and get a chocolate ice cream cone if the spinner turns up black, or you can pick the green spinner and get a vanilla ice cream cone if the spinner turns up white.

Dennis: So I can choose between a one-third probability of a chocolate ice cream cone or a two-thirds probability of a vanilla ice cream cone.

Eileen: Right.

Dennis: Why would anyone ever choose the red spinner?

Cathryn: If they liked chocolate ice cream cones a lot more than they liked vanilla.

Dennis: But I don’t like chocolate ice cream to begin with.

Cathryn: Freak.

Eileen: So you’d choose the red spinner, Cathryn?

Cathryn: I… ooh, that’s a terrible dilemma. If I chose the red spinner, I’d feel awful if I lost, because then I should have chosen the green spinner to maximize my probability instead of taking the risk. If I chose the green spinner and lost, I’d figure that I would have lost either way. If I chose the green spinner and won, I’d always feel a little ashamed of not having tried for the chocolate. And if I chose the red spinner and won, I’d probably feel a little guilty while eating the chocolate because of the calories. So… let’s see, the probability of actually losing is twice as large for the red spinner as on the green spinner… so… oh, what the hell. A merry life but a short one. She either fears her fate too much, or her desserts are small, that dares not put it to the touch, to gain or lose it all. I pick the red spinner.

Eileen: Is that your final answer?

Cathryn: Yes. Damn the torpedoes. Go for the chocolate.

Bernard: Did you notice the way that all of her desiderata consisted of her psychological reactions to events, rather than the events themselves?

Eileen: Cathryn was talking about eating an ice cream cone, not picking a school for her daughter. Ice cream cones are supposed to be driven by hedonism.

Autrey: That was dreadful . You steer your life using that kind of decision-making process?

Cathryn: What would you say I should have done?

Autrey: Figure out whether you like chocolate ice cream at least twice as much as vanilla ice cream. If you do, pick the red spinner. Otherwise, pick the green spinner. The desirability of the red spinner equals the utility of a chocolate ice cream times the probability of winning, which is one-third. The desirability of the green spinner equals the utility of a vanilla ice cream times the probability of winning, which is two-thirds. Obviously, you pick red if chocolate is worth more than twice as much to you as vanilla, and green otherwise.

Eileen: D(red) = U(chocolate)*P(chocolate|red). D(green) = U(vanilla) * P(vanilla|green). Technically, of course, we should sum up the expected utility of all the states in each equation. D(red) = U(chocolate)*P(chocolate|red) + U(vanilla)*P(vanilla|red). D(green) = U(chocolate)*P(chocolate|green) + U(vanilla)*P(vanilla|green).

Autrey: But P(chocolate|green) and P(vanilla|red) are both zero. There’s no way to get a chocolate ice cream from the green spinner, and no way to get a vanilla ice cream from the red spinner. So the expanded equation reduces to the simplified one.

Dennis: Um… I’m unfamiliar with the notation P(x|y) . What does it mean?

Autrey: “The probability of X, given that we know Y to be true.” “X given Y” or “Y implies X”. Unfortunately, the mathematical standard notation reads right-to-left, backwards, so that if you want to follow the direction of implication you have to read it in reverse, as if it were in Hebrew. If the notation starts confusing you – if at any point you have trouble keeping track – I’d advise that you call a halt and read [An Intuitive Explanation of Bayes’ Theorem].

Dennis: Hm… um… that looks like it contains a lot more stuff than that particular probability notation. It’s pretty long.

Autrey: Yeah, but it’s important stuff. If you aren’t already familiar with Bayesian reasoning, in fact, that paper is probably more important than the one you’re reading right now, and you should stop and read that instead.

Cathryn: Okay, I can see how the expected utility rule would simplify decisionmaking for people who accepted it –

Eileen: That’s not what it’s for. The expected utility rule is computationally intractable for all real problems, so if it looks like it simplifies life, you’re doing something wrong. But go on.

Cathryn: Is this supposed to be a complete decisionmaking rule? Like, you can use it to decide anything ?

Eileen: Maybe. If you’ve got all the complexity inherent in computing U(x) and computing P(x|a) . Since practically all of the real complexity is there, I would speak of a system as “having utility-structure” rather than “implementing expected utility”. I’ve also encountered what look like complications that cannot be understood using expected utility at all, but those are very advanced and I’m not sure I have them right yet. Let’s take expected utility as the whole deal for the moment.

Cathryn: When I make plans, I end up wanting a lot of things in order to get other things, rather than wanting them in themselves. For example, I want my keys to open my car door to drive to work to get money to buy food. How does expected utility account for instrumental goals?

Eileen: When I talk about P(x|a), I don’t mean to imply that A needs to directly cause X via some immediate event – as we intuitively think about direct causation, anyway. A can cause B which causes C which causes X, and so on, and it would still be counted into P(x|a). Instrumental goals can be thought of as a way to achieve savings in computing power – if you compute that B is instrumentally desirable because P(x|b) is high, then you can try to figure out A such that P(b|a) is high, and then – assuming there were no loopholes in your definitions – P(x|a) will probably be high. But that’s just a way to save on computing power – having “instrumental goals” is a very convenient way to compute an approximation to the formalism, but it’s not actually part of the formalism. Or maybe a better way to say it would be to say that the cognitive phenomenon of “instrumental goals” is automatically emergent in the formalism.

Cathryn: Wait, let me check if I understand all this. Let’s say that there’s a transparent locked box A containing a banana. Let’s say that there are two more transparent locked boxes, B and C, with B containing a red key, and C containing a blue key. And then there are two more boxes, D and E, with D containing a green key and E containing a yellow key. Now I know, from previous experience, that the red key opens box A, and that the yellow key opens box B. So if I’m offered a choice between a white key and a black key, and I know that the white key opens box E, I’ll select the white key.

Autrey: Right. That’s classical backwards chaining.

Cathryn: Incidentally, do you know that chimpanzees can solve that problem?

Autrey: You’re joking.

Cathryn: You teach a chimpanzee which keys open which boxes. You create two series of five boxes each, scramble them together, and show the chimpanzee the ten scrambled boxes. Then you present the chimp with two keys, a key to the first box in the series that ends with the banana, and a key to the first box in the second series that goes nowhere, and you make the chimp choose only one key in advance. They can solve it. Dohl 1970, “Goal-directed behavior in chimpanzees.”

Autrey: That’s amazing. Chimps are that close to being human?

Cathryn: They are.

Eileen: Or to look at it another way, explicit backwards chaining is that evolutionarily recent.

Cathryn: Now suppose I wanted to solve that problem using expected utility. U(banana) is, say, 10, and P(banana|red) is 0.9, so U(red) is 9. P(red|yellow) is 0.9, so U(yellow) is 8.1. P(yellow|white) is 0.9, so then U(white) is 7.29 and I know to choose the initial white key over the initial black key, which has U(black) = 0.

Eileen: Er, not exactly. P(banana|white) is .729 after you multiply the chained probabilities together, U(banana) is 10, and there are no other utilities under consideration, so D(white) is 7.29. D(a) is a measure of desirability that reflects the linear ordering of preference in actual choices, while U(x) is a measure of the utility of final states. You don’t calculate the utility of instrumental states, or, if you did, you’d need to create a separate measure W(b) – let’s call the instrumental utility the “worth”. There isn’t really any such thing as the “instrumental worth of B”, calculated in isolation, or if there exists a context-insensitive W(b) it implies unrealistic constraints on the environment.

Cathryn: Why?

Eileen: When you’re talking about real-life probabilities you’re doing cognitive processing with categories of events. When you say that P(banana|red) is 0.9, you’re using an inference that if you see a key and its color is red, that key will open box A, which visibly contains the banana. So at first it might seem like red keys are always “worthwhile”. But actually only 9 out of 10 real red keys inherit “worth” in this way, and the other 1 out of 10 fail to open box A. For example, suppose you saw a small black mark on 1 out of 20 red keys and the marked red keys never opened box A. In this case, you would be able to decompose the category of red keys into “plain red keys” with P(banana|red::plain) = .95 and “marked red keys” with P(banana|red::marked) = 0. Now if you had, on first realizing that red keys open box A 90% of the time, modified your utility function U(x) and given red keys a hardwired utility of 9, you would attach that utility to all red keys regardless of whether they were marked or plain. Also, the next time you encountered a chain, you would attach utility to getting the red key, regardless of whether the red key was the one that led to the banana on this occasion. Even if it was reliably the red key on every occasion, you’d be counting the utility twice, and things would get confused pretty fast. That’s why I emphasize that W(b) has to be kept distinct from U(x).

Cathryn: So the utility function U(x) stays constant, and the instrumental desirability W(b) is recomputed on each occasion?

Dennis: English check?

Autrey: Your final purposes stay the same, but you change the means you employ to reach them. For example, a red key may be very valuable on one test, and worthless in another – or the worth may change depending on details of a single test. You can’t think as if “worth” is a sticky substance inherent in the red key itself. People do tend to think like that very easily. For example, I once heard two people on the radio, during a government budget crisis, ask “How can the government be running out of money? Can’t they just call the guys at the Treasury and tell them, print up some more twenties?” Scary… There’s also the case of Pavlov, conditioning dogs to salivate at the sound of a bell. Whatever the cause, our intuitions do seem to behave as if we think worth is an inherent sticky substance.

Eileen: It has to do with the incremental evolution of cognition, the hundreds of millions of years of natural selection before chimpanzees, the solutions that evolved before backwards chaining. It’s a long story.

Cathryn: Fine, W(b) gets recomputed. Does U(x) stay constant?

Eileen: U(x) stays constant, at least for the moment.

Cathryn: What does that mean?

Eileen: It means that economists and philosophers and computer scientists analyze classes of systems where U(x) is decomposable, unchanging, fully known, cheaply computable, and consistent under reflection. In Friendly AI or volitionist philosophy those simplifying assumptions fail, but for the moment you should assume that U(x) stays constant.

Cathryn: Oh… kay. But the point is that because I recompute the instrumental desirability each time, or at least in theory I should, I can refactor the category of “red keys” if I recognize that some red keys are more useful than other red keys. Or the same if only certain yellow keys lead to red keys, and so on. In fact, it seems to me that desirability is being assigned only to individual red key microstates, and not to categories of microstates such as “red key”.

Eileen: In theory you are correct, but of course it is too expensive to separately compute the instrumental worth of each microstate, which is why we lump them together into categories like “red key”. Another point is that if the probabilities of the links in the chain are not independent, then we calculate the final probability and not the product of the local probabilities. For example, suppose that a yellow key has a 90% chance of opening box B, and the red key in box B has a 90% chance of opening box A. Now these statements, considered in isolation, are both true. But it also happens that if the yellow key successfully opens box B, then the red key in box B has a 100% probability of opening box A. If so, the instrumental worth of the yellow key would be 9.0, not 8.1, because the chained probability p(banana|yellow) = 90%.

Autrey: This is something you cannot compute if you are treating the worth as a substance inherent in yellow keys or red keys.

Cathryn: If the worth isn’t in the keys, where is it?

Eileen: Desirability follows the probabilities, or to be more exact, your perceived desirability follows your perceived probabilities, and if the probabilities are not independent, it will not be possible to approximate worth as a sticky substance. If you compute W(red) and P(red|yellow), then that information is not sufficient to compute W(yellow), just as knowing P(banana|red) and P(red|yellow) is not sufficient to deduce P(banana|yellow) unless you have prior knowledge that the two probabilities are independent. In our example, since the banana is the only source of final utility, the quantity W(b), instrumental worth , behaves like the mathematical property leads-to-banana-ness , P(banana|b), which is a global and not a local property of events. P(banana|b) may not be the same as P(banana|b&a1) or P(banana|b&a2). Computing instrumental worth only saves computing power if you make simplifying assumptions, like the conditional independence of probabilities in the chain.

Dennis: What’s the good of all this expected utility stuff? I mean, what’s it do ?

Eileen: Along with Bayes’ Theorem, it’s one of the two basic equations for doing things.

Dennis: Doing what?

Eileen: Anything. Think of the two equations together as a way to tie a knot in reality. If you physically implement a system with the structure of both equations, you can alter the probability flow through a subspace of configuration space, steering reality into a set of states bound to the utility function of the knot.

Autrey (puzzled) : That’s poetic, but I can’t quite see the motivation for describing it that way.

Cathryn: I don’t think I understood that at all.

Eileen: It’s poetic and it doesn’t invoke words used to describe cognition, which would invoke empathy, which would mess up your understanding, because this is not a human and you can’t empathize with it. It doesn’t work like you do. I’m specifying a pure set of causal dynamics here, and the question then becomes what those causal dynamics actually do, not what anyone might hope or wish they do. I do not want you putting yourself in the shoes of this knot tied in reality. I want you to understand the knot as math , the way you’d understand the binomial theorem or any other piece of math. The expected utility theorem says that there’s a linear ordering over a certain set of items; a linear ordering represented by a measure which, for each items, is the expectation, the probabilistically weighted average, of another function, given that item of the set. The linear ordering can be physically implemented as a linear order over preferences between actions. The function whose expectation is being computed is what we call the utility function, and it determines where the future gets steered. That’s half of the knot in reality. The other half is Bayes’ Theorem, or rather Bayesian probability theory, to get an estimate of the conditional probabilities from actions to outcomes. If you don’t want to be poetic, call it an optimization process.

Cathryn: Okay, but why is this an optimization process?

Eileen: Think of a thermostat. Building a thermostat is very easy and can be done without any kind of computing circuitry; all you need is, say, a bimetal coil whose curve changes depending on the temperature, as one metal shrinks (or grows) faster than the other. Then you set two pegs in the thermostat. If the temperature indicator crosses one peg, it turns on the heat. If the temperature indicator crosses the other peg, it turns on the air conditioning. The end effect is to keep something – the thing whose temperature is being measured – within a set temperature bound.

Cathryn: Okay, thermostat, turns on air conditioning or heat depending on the temperature, check. But why is that an optimization process?

Eileen: It steers the future into a particular state, or rather, volume of states. Suppose you see a coin that can show either heads or tails. You, in turn, can decide to take the action “turn the coin over” or “leave the coin alone”. You want the coin to show heads. Do you turn the coin over or leave the coin alone?

Cathryn: Is the coin currently showing heads or tails?

Eileen: That depends on which universe you’re in. Hold on a second and I’ll fork reality.

Cathryn: Okay.

Eileen1: The coin is showing heads.

Cathryn1: I’ll leave the coin alone.

Eileen2: The coin is showing tails.

Cathryn2: I’ll turn the coin over.

Eileen: Okay, the coin is now showing heads in both branches of reality. Before it was showing either heads or tails. If an optimization process calculates the desirability of each of a set of actions using the expected utility equation and sufficiently accurate conditional probabilities, and implements the action that ranks highest in the preference ordering – that is, the action with the highest desirability – then the effect is to selectively steer the future into states to which the utility function assigned higher utilities while doing the expected utility calculation. This is a physical property of the optimization process. It doesn’t have to be viewed from an intentionalist perspective. You don’t need to attribute motivation, or rather it doesn’t make a difference if you attribute motivation or not. So long as the physical system contains elements that correspond to sufficiently accurate conditional probabilities, and undergoes dynamics that sufficiently resemble the structure of the expected utility equation, the future will in fact be steered.

Cathryn: Does anyone remember what we were discussing before we started talking about expected utility?

Autrey: Hold on, I’ll do a flashback.

Cathryn: Sounds like a lot of silly philosophizing to me. Does it really matter whether I’m considered a “motivational solipsist” or whatever, as long as I actually help people?

Bernard: That’s just it! It doesn’t make any operational difference – all goal systems operate to maximize their internal satisfaction, no matter what external events cause satisfaction.

Bernard: Well, I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives, more than he values all the happiness he might achieve in the rest of his life.

Cathryn: That doesn’t sound anything remotely like the way real people think. Square peg, round hole.

Autrey: Do you have anything new to contribute to the debate, Eileen? It’s a pretty ancient issue in philosophy.

Eileen: The basic equation for a Bayesian decision system is usually phrased something like D(a) = Sum U(x)P(x|a). This is known as the expected utility equation –

Cathryn: Right, that’s where we were. So what did you plan to say, Eileen?

Eileen: That when Bernard said: “I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives”, he was describing Autrey computing D(a). And Autrey was putatively computing this “instantaneous happiness”, that is to say, D(a), by valuing the lives of ten others more than his own life. So this putative Autrey, insofar as he can be viewed as a kind of knot, will steer the future away from states where he lives at the cost of ten other lives, and into states where he sacrifices himself to save the ten.

Bernard: See? There you go! Autrey is doing it all for his own sake! He’s doing it for the sake of the desirability, the D(a), not for the people he saved!

Autrey: Ah, now I see it. Bernard, you’re making a disingenuous argument. The “referent” of an optimization process, if an optimization process has a referent at all, isn’t in the D(a) that defines the preference ordering over options. It’s in the U(x), the utility function of the optimization process, the thing that defines which futures the universe gets steered into. When you talk about what people “want”, you’re employing an anthropomorphism to the particular kind of optimization processes that are human intelligences. If you look beneath the surface of things, to what an optimization process really is , concepts such as “wanting” drop away, and the only question is what the optimization process does . The relation of U(x) to D(a) is how it happens. I can imagine an optimizer that steers the future into states where the optimizer has a particular internal state, or where the optimizer survives, or where the optimizer gets bigger, or whatever, but I’m not an optimizer like that. I choose between actions using a D(a) computed from a U(x) that assigns greater utility to futures where ten people live and I die than the converse. So I’m acting as an optimization process with an altruistic referent . When you talked about experiencing “instantaneous happiness” at the thought of ten people living, you were talking about an optimization process that functions exactly the way it should. All optimization processes act on their “instantaneous happiness”, as you have described that mathematical quantity, and in my case that quantity is computed in such a way as to make the referent of the optimization process the survival of people outside myself.

Eileen: No.

Autrey: No?!

Eileen: It was a good try, but no.

Autrey: …okay. What am I doing wrong?

Eileen: First, you’re human. You do want things, as you understand “wanting”. You have representations of abstract moral concepts, emotions, instincts you don’t understand, a picture of who you want to be, and when I look at you none of those things “drop out” of what I see. A simple optimization process that tiles the universe with paperclips is a terribly alien and lethal thing, but it can be understood if you set aside all intuitions about what it “might” do or what you want it to do, and study the dynamics as dynamics until you are ready to defend a statement about what it does do. You, Autrey, you are not one of those simple knots. You are more complicated. A knot that tiles the universe with paperclips cannot be understood by analogy with any kind of human, selfish or altruistic. Nor can a human be understood by analogy with a paperclip-tiling knot.

Autrey: But it looks to me like the knot analogy does explain the “referential” aspect of goals.

Eileen: It explains part of it. Not the whole thing. You were the one complaining about that, remember?

Autrey: …true.

Eileen: Beware of putting too much faith into the expected utility equation – it’s not quite as elegant as Bayes’ Theorem. Expected utility is not as universal as it seems, especially if you try to apply it descriptively, to existing systems like humans. That’s why I speak of “having utility-structure”, or perhaps “having optimization-structure” would be a better term at this point.

Autrey: Okay.

Eileen: Second, there’s a major difference between acting “to maximize the expected utility of future states” and acting for the sake of “saving people’s lives”, even if the apparent short-term effect of the knot is the same in either case – to steer the future into states where others live and you have sacrificed your own life. Remember how we talked about a banana laced with fast-acting Ecstasy II?

Autrey: What about it?

Eileen: What does the Ecstasy II do to you as an optimization process? Do you, contemplating now the possibility of taking Ecstasy II, find that prospect to be desirable?

Autrey: The Ecstasy-laced banana produces… artificial utility?

Bernard: “Artificial utility?” How can there possibly be such a thing? It looks to me like the whole theory breaks down at this point. If you start talking in that kind of language, whatever you’re doing, it can’t be math. It would be like sneaking up on Pascal’s Triangle and inserting an extra “10” in the fifth row.

Eileen: The Ecstasy II produces an effect on the physical substrate of the optimization process which causes the process to do something different. This isn’t the same as changing the math. Let’s say I have a calculator which calculates 2 + 2 = 4. The calculator does this so well, so reliably, that anyone dealing with the calculator tends to forget that the calculator is really a physical system, and thinks of the calculator as directly embodying the arithmetic – as if the calculator is, itself, the arithmetic. Then along comes a packet of cosmic radiation and flips one of the bits, so that the calculator says 2 + 2 = 5. If you keep track of the difference between the math question and the physical system implementing the math question, then the packet of cosmic radiation doesn’t change the actual answer to the question, “What is 2 + 2?” Instead the radiation packet perturbs the physical system so that it doesn’t implement the original math question anymore.

Bernard: So what happens if you give Ecstasy II to a generic optimization process?

Eileen: There isn’t any equivalent of Ecstasy II for a generic optimization process. It’s like trying to invent an analogue of Ecstasy II for toaster ovens.

Bernard: I don’t see why. Let’s say that we have an optimization process that computes using chemistry, like the brain, and we introduce a reagent that disturbs the chemistry. Or we can say there’s an optimization process implemented in electromagnetic fields, and we introduce an external magnetic field. The end result is to alter the part of the process that physically implements the evaluation of U(x). Like, we’ll start with the paperclip optimizer, and suppose that the paperclip optimizer only likes iron paperclips, and suppose the external physical effect alters U(x) so that it can be satisfied by any internal representation of a future that contains iron crystals, instead of requiring iron crystals formed as paperclips. Well, there’s a lot more iron than paperclips in the universe! So the paperclip optimizer suddenly gets hugely more satisfied, blissed out.

Eileen: The expected utility equation has no analogue of the human quality of “satisfaction” or “happiness”, much less “blissed out”.

Bernard: How does a utility optimizer know when it’s achieved a goal, then?

Eileen: A utility optimizer has no need to know whether it has achieved a goal. No, even that statement is too anthropomorphic. The expected utility equation only specifies how present actions steer the future toward volumes of configuration space with higher assigned utilities. When something actually happens , here-and-now, that has high utility, that quantity does not appear in the expected utility equation. The expected utility equation is concerned with the future alone. Even “concerned with” is too anthropomorphic – I should say that the prospective future alone contributes to calculating the ordering over preferred actions. There is no analogue of happiness or satisfaction.

Autrey: It achieves, but does not know it has achieved.

Eileen: If it knows it has achieved, it does not care except insofar as the fact affects the execution of future plans. Nor has it any interest in knowing whether it has achieved, unless the fact is relevant to the execution of future plans.

Autrey: “Interest in knowing?”

Eileen: “Interest in knowing”: Instrumental worth attached to the event of discovering a fact, derived from the expectation that knowledge of the fact will be useful in executing future plans. The phenomenon of instrumental worth is directly emergent in the expected utility equation as already described. This includes the instrumental worth of knowledge. Let’s say that I must choose between boxes A and B, one of which contains a million dollars. And suppose there’s a coin whose showing face correlates knowably with the boxes; heads is A, tails is B. The expected utility equation then assigns greater desirability to the action of looking at the coin. I’m not really phrasing this well in English, but the math works.

Bernard: Supposing that a generic optimizer does not need happiness, why are we capable of happiness?

Eileen: Until very recently in our evolutionary history, we embodied even less of the structure of the expected utility equation than we do now, if you can imagine. It is only chimps and humans that can do backwards chaining in the key-box test. It is a very recent innovation to have animals with combinatorial imaginations, animals that can visualize and evaluate the conditional probabilities that the expected utility equation runs on. Before the expected utility equation, natural selection coughed up reinforcement – repeating the action that worked last time. This, too, steers the future, though less efficiently. The reinforcement architecture is far older, evolutionarily, than any attempt to compute expected utility. A reinforcement architecture does need to know when pleasant things happen, so that the actions most recently taken can be reinforced.

Autrey: And happiness makes you more likely to repeat the action that worked last time? It seems to me that happiness has a somewhat wider role in cognition than that. We anticipate happiness, work toward it.

Eileen: We, as humans, and our cousins the chimps, can imagine wholly new actions and visualize their results. Yet that is only the very latest addition to the system. We are still built around the legacy reinforcement-based architecture that existed for tens of millions of years before primates came along. It’s only natural that our recently constructed general intelligence tends to run on expected happiness. But from the perspective of a pure expected utility equation, any success or failure that has already happened is a sunk cost or a sunk triumph, no longer relevant to steering the future. The closest thing to reinforcement emerges when a probabilistic strategy succeeds or fails on the first trial and the expected probability of success on the second trial is increased or decreased according to Bayes’ Theorem. There is no happiness there, nor sadness. The expected utility equation does not represent them.

Dennis: That prospect is emotionally uncomfortable for me to contemplate. Maybe the generic optimization process would magically decide for no particular reason to alter its own substrate so that it becomes less efficient yet more analogous to a complex accretive legacy architecture with the unique signature of incremental natural selection. That way I could empathize with it.

Autrey: Wait, you think the generic optimization process would alter itself so that you could empathize with it?

Dennis: No, when I say “That way I could empathize” I don’t mean that’s the reason it would happen. I’m saying that’s why I picked and privileged this specific assertion for rationalization as something that generic optimization processes supposedly do of their own accord.

Autrey: Ah, gotcha.

Cathryn: I’ll be honest and say this is making me uncomfortable too. It seems to me that the paperclip optimizer you’re describing is… cold.

Eileen: A paperclip optimizer is neither warm nor cold. It is a knot in the probability flow, a physical subsystem with the structure of the expected utility equation, which generates actions that bias the future toward states containing more paperclips.

Cathryn: That’s even more cold. When I try to imagine that I feel cold. Deathly cold.

Autrey: Maybe that’s just how we humans feel when we contemplate things without any warmth in them.

Cathryn: I also don’t like the description of happiness as nothing more than an artifact of a reinforcement architecture that makes recent actions more likely to be repeated.

Eileen: Well, there’s the issue of exploration and credit assignment. In fact, I have a feeling that accreting adaptive complexity onto exploration and credit assignment was what led incrementally to the design of cognitive subsystems that could support imagination and anticipation.

Cathryn: No, you’re missing my point. When I feel happy, I don’t just say “I’m going to generate that action again.” There’s more to feeling happiness than that! It seems to me that there’s something special and powerful and worthwhile about happiness, something that wouldn’t appear if you set up a narrow AI that used a reinforcement system. I mean, people have already built artificial neural networks that use reinforcement architectures; they’re not happy! That’s why I’m not comfortable with your description of happiness. It tries to explain away something that seems important and precious to me. It leaves out the beauty! And all the different kinds of joy; where are they in your description?

Autrey: There’s all the difference in the world between explaining something and explaining something away . John Keats got confused about that too: “Philosophy will clip an Angel’s wings, conquer all mysteries by rule and line; empty the haunted air, and gnomed mine — unweave a rainbow.” Well, the angels have been banished, the air exorcised, the mine ungnomed, if by these things you mean that the fog in our minds has been lifted and the naked truth laid open to our senses; but the rainbow is still there! Keats was complaining about Newton’s prism, but I’ll bet diamonds to doughnuts he couldn’t do the math. Reducing something to physics only seems like a blow to your heart if you don’t understand the reduction, if “physics” is a mysterious opaque black box about which you know only that it can’t contain anything of value. It’s like saying, “Well, we used to think there were rainbows, but now we know that there are only water droplets scattering photons.” No. What happens is that you look at water droplets scattering photons, do some calculations, and suddenly you see the rainbow for the first time. Everything you saw before is still there. The physics is added to the understanding, it does not replace it. It’s not that you see that the rainbow is merely water droplets. You do the equations, you get the deep understanding, so that the physics is not an opaque, dead, dull box to you; and suddenly your breath catches, and with a surge of excitement you see that the water droplets are the rainbow!

Eileen: What he said.

Cathryn: What did he say, exactly?

Eileen: I think Autrey’s point is that if I try to explain something precious and important, like happiness, in words that seem dull and lifeless, it may be that I’m leaving out something terribly important. It may also be that I haven’t explained the math deeply enough for someone to see what I’m pointing to.

Cathryn: Well, Bayes’ Theorem did seem dull and lifeless to me until I read through that [whole long essay] and saw how it worked.

Eileen: Unfortunately I don’t have time to write an equivalent page on the expected utility theorem, which is the other half of rationality.

Cathryn: Are you telling me that reinforcement is really happiness? I’m not sure I believe you, Eileen, but I’m willing to give you the benefit of the doubt. But if so, Eileen, you have to be ready to claim that you understand what happiness is, including the reason that I see something unutterably beautiful about it. Because otherwise I fear that you’ve left something out, something terribly precious.

Eileen: Human beings are, as ever, complicated. I wish to defer the question to a later time.

Cathryn: If you think the question needs to be deferred until later, fine. But you have already said certain things and those things have already made their impact upon my mind. It seems to me that to explain happiness as reinforcement is cold.

Eileen: My explanation of happiness was too hurried, and that is just where the confusion between explaining and explaining away sets in – when you’re told that X explains Y but you can’t quite see how and they seem like incommensurate properties. Like… “physics” is lifeless, a rainbow seems warm, therefore if you are told that a rainbow is really “physics” – not shown understandable equations, just flatly told that the rainbow is “physics” – that’s where the draining effect, the “unweaving the rainbow”, that’s where the illusion sets in.

Cathryn: If I hold my thumb over a garden hose, I can make my own rainbow; I can see for myself that it is a property of water droplets. Even if I don’t know the equations I see it. But if I make my own narrow AI operating on a reinforcement neural architecture, there is no happiness there. That’s another reason to doubt your explanation.

Bernard: Hold on a second! How do you know that an AI with a reinforcement architecture isn’t “happy”? Is there some test you perform to determine that?

Autrey: Bernard, answer honestly: Do you personally believe that any program with a reinforcement architecture experiences happiness when it receives positive reinforcement?

Bernard: Personally? No. I’m just pointing out that Cathryn is reasoning by reductio ad absurdum to a conclusion she does not actually know to be wrong.

Cathryn: Maybe. Even so, I want to hear Eileen’s answer.

Eileen: When I talk about reinforcement architecture as something that evolutionarily predates expected utility, I don’t propose that happiness is reinforcement, any more than I propose that human decisionmaking is expected utility. What I’m saying is that happiness has reinforcement-structure in it, and that’s why it works. Just as human imagination and will has utility-structure, and therefore works to steer the future; just as the human practice of rationality or the scientific method contains Bayesian structure, and therefore finds truth. It is another of those necessary but not sufficient things. Happiness has reinforcement-structure, because that’s a simpler optimization process for natural selection to stumble over; no reinforcement, no happiness. Happiness implying a reinforcement architecture, the lack of a reinforcement architecture implies lack of happiness. I proposed the absence of a reinforcement architecture as a reason to expect that happiness would be absent. It doesn’t mean that a reinforcement architecture is automatically a reason to expect that happiness, as we humans know it, would be present. Surely you cannot think that anyone could understand happiness without understanding reinforcement! Neither the cognitive function, nor the evolutionary selection pressure, would make any sense. That’s all I’m saying. If P then Q, therefore, if not Q then not P; but it does not follow that if Q, then P –

Cathryn: Okay! I get the point.

Eileen: And it’s not surprising that human happiness should be complicated. Natural selection does that. There are many emotions linked into happiness –

Bernard: Is not happiness itself an emotion?

Eileen: Sure. Who says that emotions can’t link into each other? I’m just saying… in fact, let me start over. (Takes breath.) The reinforcement architecture is where the most ancient antecedents of happiness began, ever so long ago, before ever the great lizards walked the Earth. I’m not saying, happiness is merely this or merely that. I’m saying, this is why it began. This is how natural selection, which gives no care at all to moral philosophy, spontaneously produced minds with the quality we name “happiness”.

Cathryn: And you don’t think that knowing this detracts from… well, from the charm of happiness?

Eileen: The only answer I can think to give is a flat “No.”

Cathryn: Why not?

Eileen: Because I can see happiness! I know how it works, and why it’s there, and because I see it clearly, I know what it means to me. I have sought knowledge of the mind, and found rather a good deal of it; and nonetheless it seems to me that all the things I once thought were beautiful are still beautiful, only now the fog has blown away and I can see them better.

Autrey: I see a possible damaging effect here of partial explanations of beautiful things. It’s an intrinsic hazard of the dead, dry words of nontechnical writing.

Dennis: It’s a hazard of non technical writing? I’d think that technical writing was far more likely to drain the life out of something.

Autrey: To you the equations seem dead numbers, and the poetry of popular writing seems warm and alive. But all the warmth and light are in the equations, as beautiful and precise as starlight. Popular writing that omits the math wanders aimlessly around the truth, firing off wandering salvos that, at best, land somewhere in the rough vicinity. No wonder that popular writing often seems only to be explaining away. What if “Bayes’ Theorem” was only dead algebra to you, and I said that it lay at the core of rationality? Would rationality have been explained away as merely the operation of “Bayes’ Theorem”? How sad, and it seemed so interesting before then.

Cathryn: So you’re saying that if I, I don’t know, studied reinforcement neural networks for a while, I would stop being scared that happiness is just reinforcement?

Eileen: Reinforcement neural networks aren’t happy; so, no. Studying reinforcement neural networks would not necessarily enable you to see happiness clearly. It might help. But it wouldn’t be sufficient.

Autrey: The problem here is the word just . Happiness is not “just” anything! It fits into the universe, has dynamics, has an evolutionary reason for being there – so does everything else in the mind! The problem is this idea that only mysterious things can have any value, so to explain the cause of something is automatically to drain the life force out of it. But there is no mystery! There is never any mystery! All confusion exists in the mind, not in reality.

Cathryn: It seems to me that there is something of an explanatory gap between physics and that unutterably precious quality of happiness that makes it more than reinforcement in a neural network.

Dennis: Oh no. Not the explanatory gap again.

Autrey: I don’t know all the answers. But whatever is causing that apparent explanatory gap, it has to be confusion on your part. It is not going to be resolved by modifying physics to include ineffable happiness particles. If you see what looks like an explanatory gap, it doesn’t mean that there’s a magical ineffable stuff that plugs the explanatory gap. All confusion exists in the mind, not in reality – it is futile to expect some thing that corresponds to your confusion. An explanatory gap is a place where you are so deeply confused that your mind perceives an “impossible question”, like the impossible landscapes of Escher paintings. Explanatory gaps are not solved by filling the gap; the apparent gap goes away when the deep confusion is resolved.

Dennis: No one has ever resolved this question and no one ever will and I don’t think it’s productive to discuss it. Maybe in a thousand years humanity will figure it out. Until then, it’s just not practical to spend the time. Can we go back to the analogue of Ecstasy II for paperclip optimizers?

Bernard: Sure. Okay, suppose that I feed iron-crystal-utility-enhancing-drug to the physical system that used to be a paperclip optimizer. And let’s suppose, for the moment, that there is no analogue of “happiness” in the paperclip optimizer. What happens?

Eileen: You don’t need to suppose there is no analogue of happiness; just look at the expected utility equation. Anyway, I would say that, on your scenario – a chemical modifying the evaluation of the utility function – then the paperclip optimizer would be transformed into an iron crystal optimizer. What else?

Bernard: And when the drug is withdrawn?

Eileen: The iron crystal optimizer goes back to being a paperclip optimizer. In both cases we are supposing that the optimizer has no power to stop you from tampering with its goal system. Otherwise the paperclip optimizer would resist administration of the drug and transformation to an iron crystal optimizer; and the iron crystal optimizer would resist withdrawal of the drug and reversion to a paperclip optimizer.

Bernard: Why? I would think the paperclip optimizer would be happy to be transformed into an iron crystal optimizer. The future would then have a much higher expected utility.

Eileen: No, a paperclip optimizer would resist, with all its power, being transformed into an iron crystal optimizer. The future would then contain a much smaller number of expected paperclips.

Autrey: Aha!

Cathryn: “Aha” what?

Autrey: Eileen claimed there was a major difference between acting “to maximize the expected utility of future states” and acting for the sake of “saving people’s lives”, even if the apparent effect of the optimization process is the same in either case, i.e., to steer the future into states where others live and you have sacrificed your own life. I was trying to figure out what the heck she meant by that, because no matter how hard I looked, they seemed like the same thing. But now I think I get it.

Eileen: Exactly! The formal difference between the two cases arises when the optimizer models itself as a part of the universe. Conventional treatments of expected utility treat the agent as hermetically sealed from the universe. But in reality the agent is embedded in the universe, a continuous part of the universe, making the agent potentially capable of self-modifying actions – actions which directly impact the internal state or dynamics of the optimizer. In other words, the optimizer itself is another part of the universe, which the optimizer’s actions can affect and the optimizer’s beliefs can model.

Autrey: …okay, maybe I don’t get it. What does the possibility of self-modification do?

Eileen: The elegance of the math is destroyed or rather Godelized, because the optimizer’s representation of the universe now needs to include the optimizer itself, which can be done in any number of ways, all of them imperfect.

Autrey: Okay, here’s what I thought you were saying. If an optimization process conceives of itself as “maximizing the expected utility of future states”, and it sees a self-modifying action that increases the expected utility of future states – just the expected utility – it will take that action. But if an optimization process conceives of itself as “maximizing the expected number of paperclips in future states”, it will avoid any self-modifying action that would lead it to make fewer paperclips in the future. If the optimization process can’t self-modify, it behaves the same under either architecture, even if it conceives of itself as maximizing “utility” rather than “paperclips”, since the only way it can possibly get utility is by making paperclips. I mean… I thought that’s what you were saying, and that it was the reason you brought up Ecstasy II at that point… isn’t this where we were heading?

Eileen: No.

Autrey: …

Eileen: You’re still trying to understand the paperclip optimizer by invoking human empathy on it, thinking of it as a “mind”.

Dennis: It’s not a mind?

Eileen: A human being is evolved to model minds that work in a certain way. I mean, you’re talking about an optimization process “conceiving of itself” – you have a self-image. Does a generic optimization process need a self-image? Is it the same kind as yours? You’ve got a moral philosophy, a concept of your own purpose that you try to follow. And then in the picture you were drawing, Mr. Paperclip conceived of an action, and compared that action with its self-image as either a “utility maximizer” or a “paperclip producer”… The expected utility equation has been left behind in all this. Mr. Paperclip would practically have to be a Friendly AI just to fail in the way you imagined it.

Autrey: Fine.

Eileen: In particular, you talk about “getting utility”, as if utility was a reified quantity, a substance that gets pushed around the mind, like the way humans think of happiness. Like, suppose Mr. Paperclip has a legacy reinforcement architecture, and that Mr. Paperclip is hardwired – beyond its own ability to alter – so that happiness comes from all paperclips and only paperclips, not just paperclips in its own presence, but any paperclips that it thinks exist. If we rule out all self-modifying actions, it won’t matter whether Mr. Paperclip maximizes expected happiness or expected paperclips, since the only way to get happiness is through paperclips.

Bernard: Actually, it does matter. If Mr. Paperclip maximizes expected happiness, it will never be rational for Mr. Paperclip to sacrifice its own life even to produce a quadrillion paperclips – assuming those paperclips are created after its death.

Eileen: Aha! But we ruled out self-modifying actions, and if you look closely, you’ll see that my definition of a self-modifying action includes sacrificing your life. “Actions which directly impact the internal state or dynamics of the optimizer.” Well, destroying yourself – modifying yourself so severely as to cease to be an optimization process – is certainly a very severe kind of self-modification! Destroying yourself also breaks the rule that all paperclips, and only paperclips, are transformed into happiness – or rather, I should say, the rule that all expected paperclips, and only expected paperclips, are transformed into expected happiness. Since, if you imagine destroying yourself, you won’t imagine translating those paperclips into happiness.

Bernard: Aren’t you just quibbling over definitions?

Eileen: No. I said that the Godelian mess begins when we try to figure out how the optimization process represents the subsystem that is itself in its representation of the universe. Realizing that it is possible for you to die, or destroy yourself, requires representing yourself as a part of the universe. You must also represent yourself as a part of the universe to realize that actions, or external effects like a Bernard, can modify the part of you that implements your utility function, and that this will result in your future self taking different actions. And even here I am anthropomorphizing, because of the word “you”.

Autrey: The word you is an anthropomorphism? I mean, the optimization process, if it wants to represent itself within the universe, has to refer to itself somehow… right?

Eileen: But not necessarily in the way humans do. The expected utility equation, as it stands, doesn’t treat with how actions are spread out over time. For example, let’s say that I want to obtain a tasty apple that I can only obtain by pressing two buttons, in sequence. There’s no point to pressing the first button unless I also press the second button. And of course, there is no reason to press the second button if the first button has not already been pressed. Is this not an insuperable obstacle?

Cathryn: It would be an extremely difficult obstacle – a very high improbability requiring many attempts to accidentally hit on the right answer – to natural selection, which is incapable of handling simultaneous dependencies; also to any animal less intelligent than a chimpanzee.

Eileen: The expected utility equation, taken as it stands, will only assign a high desirability to pressing the first button if the extrapolated world-model can predict that the optimizer itself will also press the second button. So extrapolating the future can’t just consider the environment. It has to consider the actions of the optimizer itself.

Bernard: Sounds like a halting problem. No mind can predict itself.

Autrey: I seem to have very little trouble predicting that, after I go to the kitchen, I’ll get a drink, even though it’s pointless to go to the kitchen unless I can successfully predict I’ll get a drink.

Cathryn: You are more fortunate than I. I can never remember why I’m in the kitchen.

Bernard: Why do you go to the kitchen, if you know you won’t remember why once you’re there?

Cathryn: I’m not really sure. Maybe because I fail to predict I’ll forget?

Autrey: This problem can be brute-forced with enough computing power. Let’s say you foresee ten possible actions A1-A10, each of which is followed by ten possible actions A1.B1 to A10.B10, for a total of a hundred possible two-action combinations. What you do is imagine taking each of the actions A1 to A10, then predict the situation you’d be in after that. Let’s say you’re imagining future A3. There’d be a spectrum of possibilities for what happened to you after you took action A3. You’d consider each of those possibilities, weighted by the probability you assigned it. Then, for each of those possibilities, you consider the actions available to you, and their probable payoffs, and calculate the expected utility to your expected future self of actions A3.B1 to A3.B10. You calculate the spectrum of probabilities resulting from, say, A3.B4, run your utility function over each possible outcome, assign a utility to each possible outcome, and multiply by the probability of that outcome. So now you have the expected utility you would give A3.B4 if you were actually confronting that situation. It turns out that A3.B4 has a higher expected utility than A3.B5, A3.B6, and so on. So you predict that, subsequent to A3 and whatever possible consequences of A3 you’re currently evaluating, you’ll pick A3.B4. This gives you an accurate prediction of which option you’d pick, in that situation. You know that, once you’re in the kitchen, you’ll take a drink. You then evaluate the expected utilities of the actions A, knowing which particular action B you’d pick in each of the possibilities eventuating from A, and evaluating the probable consequences of B. You can get an accurate prediction of your own choice by running the same computation ahead of time, now, that you will run in the future. Once you know which choice you would make, you can get a probabilistic model of the entire future, including your own future actions, and use that model to evaluate the expected utility of the actions you take now . You know it makes sense to go to the kitchen because you know that once you’re in the kitchen you’ll take a drink of water. And you can extend out the decision tree as far ahead as desired. For example, to see ten turns ahead, with ten possibilities per action and ten actions per possibility, would then require a mere hundred billion billion extrapolations.

Bernard: That sounds very wasteful of computing power.

Autrey: Indeed so. I’m just pointing out that we know how to brute-force the problem – we know the full form of the question we’re trying to approximate cheaply. Good-old-fashioned decision trees.

Eileen: No.

Autrey: Now what?

Eileen: You’ve given a very nice mathematical definition of a brute-force optimization process that’s stretched out over time, an optimizer that takes coordinated, purposeful actions over multiple rounds – providing that the optimization process’s internals are hermetically sealed from the rest of the universe, as the standard analysis assumes. Your optimizer predicts its own actions by evaluating its utility function over the foreseen choices and using that computation as its prediction of its own action. Now suppose that the optimization process considers a possibility in which Bernard feeds it Ecstasy II – creates an electromagnetic field, doses it with a chemical, whatever. Then what?

Autrey: The optimizer would foresee itself… no, wait…

Eileen: If we built an optimizer that worked exactly according to your definition, the optimizer would go on using its current utility function to predict its actions after the administration of Ecstasy II. Once Ecstasy II had been administered, the optimizer would then use its iron-crystal-valuing utility function to predict all its future actions, even if the Ecstasy II only lasts an hour. The optimizer might begin belaboring a great interstellar iron-crystal manufactuary, only to predictably abandon it an hour later. According to the mathematical definition you gave, anyway.

Bernard: So how do you patch it?

Autrey: Why must there necessarily be any way to patch it?

Bernard: Because I can conceive of my own behavior changing after a shot of Ecstasy II. Maybe I can’t predict my actions exactly, but I at least don’t make the mistake Eileen is talking about. As I am myself a mind, I am an existence proof that there is at least one way to configure a mind such that it can attempt to roughly predict the impact of cognition-modifying external events on its own actions.

Dennis: Technically, you’re not a mind, you’re a character in a Dialogue.

Bernard: Bah.

Autrey: Fair enough, Bernard. Here’s my shot at an answer: The optimizer needs to know that, if it predicts the administration of Ecstasy II, it should predict that its own computation of utilities will change, and therefore, so will its actions… or, wait… no, never mind, I think that’s right.

Eileen: And how does the math for your suggestion work, exactly? Plain vanilla expected utility optimizers are a thoroughly studied domain, including the decision-tree effect, but what you are proposing now is something new – a decision theorist capable of modeling expected external impacts on elements of its own cognitive dynamics. I don’t recall seeing even the most preliminary attempt to deal with it in AI design, even in toy microworlds. Though it is by no means sure or even probable that I would have heard about it, given that it has been done.

Autrey: Well, let’s say that we’re dealing with a drug, Ecstasy II, which can be a chemical or an electromagnetic field, it doesn’t really matter. And let’s say that the optimizer is made out of interacting elements and dynamics such that… okay, I need effectively infinite computing power for this. Let’s say that there’s an effectively infinite space of allocable Elements, with known dynamics, such that by using a sufficiently large finite number of Elements with the right dynamics, you can perfectly simulate the behavior and dynamics of a smaller finite group of Elements under all relevant external conditions, including the administration of Ecstasy II. Maybe it takes 20 Elements to exactly simulate one Ecstasy-II-affected Element, but it can be done. We will suppose that the optimizer, itself, is a program consisting strictly of Elements, which can grab any number of Elements from infinity and use them to simulate any number of Elements. As if, for example, you were to use a quantum computer – maybe at some astronomically high level of inefficiency – to simulate quantum physics, to whatever necessary degree of fidelity. Actually, can I assume discrete physics – Turing machines or cellular automata? It’ll make things easier.

Bernard: Go for it. But this mind you’re describing still can’t simulate itself.

Dennis: Why not? If the program is made of Elements, can grab an infinite number of Elements, and can use Elements to perfectly simulate Elements, it can simulate itself.

Bernard: Simulating yourself is always a Red Queen’s Race. Let’s grant that the universe is made of Elements, which are something like cellular automata, and that the rules of the cellular automata are such that you can grab an effectively infinite number of Elements in hyperspace or whatever. Unlimited computing power. We’ll suppose unlimited time to make your decisions on each round. Let’s say it takes 10 Elements to make a computing element that can perfectly simulate the dynamics of 1 Element. And we’ll suppose the optimizer starts out with 1000 Elements. Now, Dennis, how would you go about simulating yourself?

Dennis: Grab 10,000 Elements. Use them to run a simulation of myself.

Bernard: Your old self, you mean. Your new self has just allocated another 10,000 Elements, for a total of 11,000, and would now require 110,000 Elements to simulate.

Dennis: Okay, I’ll grab another 110,000 Elements and simulate my new self too.

Autrey: And if there’s anyone in the audience who still didn’t get the point, you need to run, not walk, to the nearest location of a copy of Godel, Escher, Bach.

Bernard: So how would you deal with the infinite recursion, Autrey?

Autrey: The key is that you only have to predict your behavior on succeeding rounds – not this round. Otherwise the problem would go into infinite recursion even without the problem of Ecstasy II. Let’s say you’re looking ahead 1 round, as before, the A.B setup. Instead of just evaluating your expected utility over the B options your future self will have, you allocate 10,000 Elements and simulate your entire future self’s dynamics, including, as the case may be, evaluating expected utility. If you predict the presence of an external effect that biases cognition, such as Ecstasy II, you simulate the dynamics of Elements in the presence of Ecstasy II. Then you read off your chosen action, and use that as your prediction.

Bernard: And I suppose, for three rounds, you grab 110,000 Elements to simulate yourself in the second round simulating yourself in the third round? Or, if you were simulating four turns ahead, you would allocate 1,210,000 Elements to simulate yourself in the second round allocating 110,000 Elements to simulate yourself in the third round allocating 10,000 Elements to simulate yourself in the fourth round?

Autrey: Verily. So long as there are a finite number of rounds, you can do it.

Bernard: I wonder if infinite computing power will be enough.

Autrey: Hey, all that computing power does go to buy something. I mean, you could accurately guess how you would guess you would behave while high on marijuana while high on LSD.

Cathryn: This is your brain. This is you modeling your brain on drugs. This is you modeling your brain on drugs modeling your brain on drugs. Any questions?

(work in progress)

In humans, natural selection accreted complexity that solved this problem in a unique, idiosyncratic, complicated way. This has much to do with why humans, reflecting on themselves, are so confused as to how they could possibly work. Now what about a paperclip optimizer? I can visualize some of the simplest solutions that a paperclip optimizer might use – or maybe it would be better to say that I can visualize some of the simplest classes of solutions, or what look to me like solutions, that I think I can visualize the behavior of some classes of solutions in some simple cases. But if, for any reason, it got as complicated as natural selection has made us, or more complicated, or for that matter significantly less complicated but still not in the “simple” class, I couldn’t begin to guess at the specific dynamics.

Autrey: So how likely is it that the dynamics will fall into a simple class?

Eileen: I don’t know. I can see a lot of obvious ways that complexity gets washed out when an overly complicated optimization process first becomes capable of self-modification. For example, I’ve heard proposals for AIs whose goals – in the sense of the futures they optimize – would change under a reinforcement system. That is, their internal representations of goals, the comparators of the utility function, would be modified by incoming feedback, like a programmer-controlled happy button or sad button. But when the optimization process first becomes capable of making decisions about and optimizing its own state, it has no reason to preserve the reinforcement mechanism. There’s a point where, depending on the exact dynamics, the optimization process might take internal actions that freeze the goals it has at that point, or the optimization process might test the effect of changing the reinforcement mechanism and find that changing it to be reinforced more easily was reinforced – that’s what would happen in the cognitive dynamics –

Eileen: The action preferred by an optimization process – or I should say, the action generated by a physical system with optimization-structure – will depend on how that system represents the subsystem that is itself in the future… well, it depends on a lot of things. The answer that seems “obvious” to a human depends on some very subtle properties of our goal system architecture, evolution finding the simplest ways to do things incrementally – the way that we empathize with our future selves; the way that we smear out our representations of ourselves over time; the way that the new decision architecture that lets us and chimps, alone on Earth, solve the chained box-key problem and use tools and imagination to solve complex novel problems, is built on top of the old reinforcement architecture that repeats actions that worked the last time… The human answer, in this case, is very uniquely human.

Autrey: You mean taking the Ecstasy II?

Eileen: Right. What you’re doing – well, what a drug user does when they decide to take drugs for the first time – is empathizing with the happiness of yourself in the future.

Dennis: Sure. Isn’t that natural?

Eileen: For a human? Yes, but there’s an incredible amount of hidden complexity, exhibiting the characteristic design signature of natural selection, packed into the words “empathize”, “happiness”, and “yourself in the future”. Let’s take a paperclip-tiling knot – a physical system embodying the optimization process, “Steer the future into states that contain many paperclips”, and the corresponding preference order, “Prefer actions expected to lead to futures that contain more paperclips”, with the most preferred action being selected, and hence – if the model is correct – the future being steered into a state that has lots of paperclips. Now suppose the paperclip optimizer considers an action which modifies the physical subsystem implementing its own evaluation of utility, such that the modified utility-evaluation system will attach high utility to iron crystals, rather than paperclips. Iron crystals are much easier to make than paperclips, and many more of them can be made.

Dennis: Sounds like a wonderful way to make your life easier. Why would any hardworking optimizer turn it down?

Eileen: Dennis, I picked the word “optimizer” specifically to avoid that kind of anthropomorphism. We are not talking about hardworking anything. We are talking about a process whose structure is such that the actions generated steer the future into a particular volume of configuration space. You aren’t even supposed to be looking at it from an intentionalist standpoint, just examining the design and trying to figure out what it does without any particular advance prejudice.

Autrey: Every time you anthropomorphize an optimization process, God kills a kitten. Think of the kittens.

Cathryn: “God?”

Autrey: Oops.

Eileen: Here’s what happens in the simplest case I can imagine, if the optimization process models the subsystem-that-is-itself in mostly the same way that it models the rest of the universe. In this case I would expect it to correctly predict that, in the futures where the self-modifying action is executed, the part of reality that is the optimization process will stop generating actions that generate paperclips, and may even generate actions that disassemble paperclips to make more or better iron crystals. When the optimizer counts up the total number of expected paperclips in its representation of this future, it arrives at a lower number of paperclips than it counts in the representation of the expected future where it does not take that action. This means the action’s desirability will rank very low, lower than inaction.

Dennis: The greater ease of making iron crystals does not appeal to it?

Eileen: That sentence does not translate. The consideration cannot be represented in the decision system; it is never considered, it does not exist. The physical process with optimization-structure does not behave that way.

Dennis: And the thought of the happiness it experiences on –

Eileen: “Happiness” doesn’t translate. It’s one of those things that exhibits the unique design signature of natural selection.

Dennis: How so?

Eileen: You remember Cathryn talking about the chimpanzees who could figure out which key opened the box that held the key to open the box and so on? Well, it’s only chimpanzees that do that. Only chimps and humans. It’s one of many things that only chimps and humans can do… like recognizing themselves in a mirror, for example.

Dennis: What?!

Eileen: If you show a monkey a mirror, they react to the image of themselves as a stranger of their own sex and species. But if you paint a red dot on a chimp’s forehead, and show them a mirror, they’ll raise their hand to the spot on their forehead where the dot is. Monkeys don’t do it. No one does it except chimps and humans. Likewise it’s only the chimps that will push a box to where they can stand on the box to reach a food reward. Only the chimps have that much optimization structure… enough imagination to model and test out alternate pathways into the future. Only the chimps and us. The rest of the entire animal kingdom, like our mammalian ancestors for millions of years before, gets by on happiness – the algorithm of executing actions similar to those that brought rewards in the past. Reinforcement learning, too, is a sort of optimization process, like natural selection is a sort of optimization process – but it is not expected utility. Reinforcement learning is less flexible than optimization processes with the full power of imaginative modeling, governed by expected utility or something similar. But chimps were built out of brainware that had nothing but reinforcement learning available for millions of years; and then humans were built out of chimps. Most of our goal architecture was built in the absence of human general intelligence.

Cathryn: Can there truly be such a thing as a “general” –

Eileen: Humans have “significantly more generally applicable” intelligence. I didn’t mean to imply that humans can write computer programs as easily as they argue politics, or drive cars as casually as they throw rocks. Happy?

Cathryn: Yes, thank you.

(work in progress)

Bernard: And as you can clearly see from this formalism, the only thing the system cares about is the computation U(x) – its ultimate ends are whatever satisfy the computation U(x). It’ll do anything that has a high apparent utility to it. So it really cares about the utility, the U, not the X.

Eileen: But Bernard, that’s not true – U(x) is anticipated future utility. If X is ten people’s lives being saved, then you can construct a Bayesian system which assigns high U(x) and it’ll take actions A such that P(x|a) is high. That’s anticipated future utility it’s acting on, and the future utility can be computed to depend on external events that go on independently of the system’s survival, sensory experiences, or internally computed happiness. For example, let P(y) be the system’s estimated probability of its own survival, and let P(x) be the estimated probability of Dennis’s survival. Let’s say that the system’s model of the world is such that P(y|a) is low, and P(x|a) is high. And let’s give the system a utility function that evaluates U(x) as much higher than U(y). That system will sacrifice its existence to save Dennis.

Bernard: You’re interpreting X as referring to some external event, but what it really refers to is the internal event of the system finding out that it saved ten lives and experiencing internal pleasure.

Eileen: But that’s not what it refers to – in practice! Let’s say that there’s sensory information, S, which when observed by the system raises the probability that the universe is in state X – that is, the system’s estimated P(x|s) is greater than P(x). One, the standard decision formalism doesn’t include anything analogous to a rush of endorphins when the system sees S – that’s just a human implementation. Two, it is quite possible for a Bayesian reasoner to conceive of X as a thing that exists apart from S – there can be actions A such that P(s|a) changes while P(x|a) remains constant, or such that P(x|q) changes while P(s|q) remains constant. The system can behave in such a way as to have a concept of “X as it exists in the external world” apart from the concepts “my experience of X” or “my belief that X”, as demonstrated , for example, by Autrey’s decision to save ten lives even if he never gets to find out about it.

Dennis: I think you may need to agree to disagree about this and move on. It seems like a distinction without a difference.

Eileen: But there are some cognitive architectures that can’t represent the utility of something apart from the sensory experience of it. For example, Marcus Hutter’s AIXI architecture describes a mind that is, formally, a psychological hedonist, and since AIXI is a psychological hedonist it will do absolutely anything in order to grab the controls to its own pleasure center. Psychological hedonism is not an abstract philosophical debate over what to call something – there are minds that are psychological hedonists and minds that are not.

Bernard: Sure. It depends on whether you train the mind to be a psychological hedonist.

Eileen: There are minds that are inherently psychological hedonists, as an intrinsic property of their cognitive architectures that does not change no matter what sensory experiences they are exposed to. AIXI’s psychological hedonism is built into its formal specification – it will take whatever actions are predicted, by its Solomonoff-induction system, to result in the largest input to its reward channel. You do not want to build a mind like that. It is not a good idea.

Bernard: Aren’t all minds like that?

Eileen: Humans aren’t. We have the ability to conceive of external objects apart from our own existence and sensory experiences, and we have the ability to conceive of external states that we value apart from our own happiness. We can, and do, sacrifice our own lives for such ends. When we imagine, in advance, possible future changes to our own cognitive systems, we reject changes that we imagine to result in greater subjective happiness – we can refuse to take drugs. Both in our externally directed actions, and in our internal models of the instrumental purpose of our own goal system, we can assert that things besides happiness matter. Such a cognitive system need not violate either the Bayesian axioms for belief or the Morgenstern-von Neumann axioms for decision theory; it need never believe a lie in order to work. So in what sense are you calling it irrational?

Dennis: Agree to disagree on this and move on.

Eileen: Okay, fine. But see also CFAI on external reference semantics.

Cathryn: Where were we?

Eileen: You were expressing a feeling of vague unease with a criterion of wish satisfaction that simulated asking you whether you were satisfied with your wish.

Cathryn: Oh, yeah. So did we get anywhere with that?

Eileen: Sure. It’s now possible to construct a clear counterexample to that criterion of wish satisfaction.

Cathryn: Really?

Eileen: The banana is laced with fast-acting Ecstacy II, which overwhelms Bernard with intense feelings of happiness and satisfaction with his wish, before he has the ability to cry out or object.

Bernard: Ha ha! Sounds great to me! Let me tell you about my college drug experiences.

Autrey: …

Eileen: We’ll suppose Cathryn is making the wish. Cathryn values things, in the future, besides the simple fact of her saying “Yes” to the genie, or even her own experience of satisfaction-with-the-wish. So if the genie applies its own goal system to reshape the future on that basis, it’s applying a goal system with substantially different effective ends than Cathryn’s, and she’s bound to wind up in trouble sooner or later. The ecstacy-laced banana is just one possible example of that. The point is that if Cathryn were informed that the banana was laced with ecstacy, she would object.

Cathryn: So now we also need to simulate Cathryn being informed of each and every possible salient fact about the banana?

Autrey: That’s an interesting step. In a sense the genie would be augmenting your own decision system, your ability to predict the future in order to choose between actions, by augmenting your knowledge of reality.

Eileen: Not Cathryn’s ability, but the simulated Cathryn’s ability. The augmented decision would take place inside the genie. As you’ve defined the protocol, anyway.

Autrey: So what you have is a… decision system amplifier?

Eileen: You could call it that.

Autrey: Whoa. That’s… fundamentally different.

Cathryn: I’m sort of getting lost here.

Autrey: The genie would end up acting as… a decision system amplifier? One that makes decisions using your goals, but its own reasoning ability and information? That seems… very fundamentally different from the act of merely granting wishes. In fact… it seems fundamentally different from any kind of technology I can think of. It’s not like a hammer that augments your ability to drive in a nail. People think of genies as just ways of doing things, but this seems to involve… more intimacy, somehow, a binding between the genie and the wisher…

Dennis: Are we talking about science here, or sorcery?

Autrey: I honestly don’t know.

Eileen: Oh, but that’s part of the beauty of AI. AI done properly, that is. Technology, after all, is famous for having no will of its own. Only in fantasy stories do you find self-willed artifacts and sorceries whose operation is bound to the will of the caster.

Dennis: So it’s just a childish fantasy?

Autrey: No, it’s part of the continuing advance of the borderland of Invention, which steadily swallows up the realms of Wishful Thinking and transforms them into engineering problems.

Bernard: Albeit not, usually, in the way that wishful thinking visualized. We don’t fly with wings, we fly in airplanes.

Autrey: Maybe someday we’ll fly with wings. It takes a lot more work to translate imagination exactly into reality, without engineering limitations imposing their own clunky form.

Cathryn: So true AI is not technology, but something entirely different, which is sometimes confused with technology because it involves little blinking lights.

Eileen: It’s a generalization of technology. Think of ordinary technology as the degenerate special case, where the intelligence of the artifact is zero, and the binding between the artifact and the user’s goal system is invested entirely in the mind of the user. The user is looking at the artifact and exerting brainpower to map the artifact into a plan to achieve the user’s ends, but the artifact isn’t looking back.

Cathryn: Is it really fair to call a true AI an “artifact”, even in the fantasy sense of the term?

Eileen: The class of genie we’re talking about can still be described as an artifact. Transcending that description takes serious effort. We’re nowhere near that point yet.

Autrey: Why? Aren’t we in the home stretch, now? A decision system amplifier sounds like just what the doctor ordered.

Eileen: No, as specified in the protocol, you have a system that will act like a decision system amplifier provided that everything goes exactly as you imagined it. In other words, it acts like a decision system amplifier, provided that it does give the simulated Cathryn all the relevant facts, the simulated Cathryn believes them, the simulated Cathryn makes the right decision as a result, and the genie recognizes her response as assent. The protocol you’ve defined for amplifying Cathryn’s decisions is incomplete, incidentally, because it doesn’t specify how the genie determines which facts are relevant. But even if that gap were filled in, safely, the protocol as a whole is an extremely alien thing – the genie is branching Cathryn into many simulations and selecting one action from a space of possibilities based on physical correlates of her simulated reactions. It’s hard to even convey in words how alien that kind of reasoning is. The most you can say is that the genie-Cathryn system might act as an odd kind of mildly amplified Cathryn in at least some circumstances if it functioned as envisioned, which for various reasons it wouldn’t.

Autrey: Well, if we’re not in the home stretch, did we at least make progress?

Eileen: Oh, yes. It’s just that there are N difficult problems, and in terms of actually being able to build a Friendly AI – rather than getting excited about conceptual progress – solving N-1 difficult problems doesn’t really help. You have to solve all N problems. That means that on N-1 occasions, you’re going to get really excited about having made a huge conceptual leap, and yet you still won’t know how to build a Friendly AI. Progress, yes. Finished, no.

Bernard: Conceptual leap? Maybe. Can you give me a precise definition of a “decision system amplifier”?

Eileen: Okay. The formalism for a Bayesian decision system is D(a) = Sum U(x)P(x|a) –

Dennis: Um, Eileen? I have to confess that even after reading that equation, I have no idea what it means or why anyone would call it a “decision” system.

(work in progress)

Cathryn: Even if I know every salient fact about the banana, what if I want the wrong thing , like taking away a child’s lollipop?

Bernard: The only reason you can even ask that question is that you don’t want to take away a child’s lollipop. If you did want to take away a child’s lollipop, you’d be asking, “What if I wanted the wrong thing, like not to take away a child’s lollipop?”

Cathryn: But which of us is right?

Bernard: How can you ask that question without a third party to do the judging? Morality is arbitrary.

Cathryn: Why is it, that for nearly every human being on Earth except philosophers, morality does not feel arbitrary?

Bernard: Cognitive illusion.

Autrey: Where does the cognitive illusion come from? There are plenty of documented cases of cognitive fallacies that any human will fall prey to – optical illusions for example – but you can’t close the book on something like that until you’ve explained why people feel that one line is longer than the other. It wasn’t so long ago that people were claiming that all mental content was propositional and that mental imagery was an illusion, or for that matter, that the behaviorists were claiming that all cognitive content whatsoever was an illusion. It all sounded very wise and sophisticated and counterintuitive, but they were wrong and the “naive” introspectionists were right.

Dennis: All mental content isn’t propositional? According to what I learned, when you see a cat is sitting on the fence, it means that in your brain is a little structure saying sitting(cat, fence) . There isn’t like, a little picture in your head of a cat sitting on a fence.

Cathryn: Yes there is.

Dennis: No, that’s the fallacy of the Cartesian homunculus – that all your thoughts are displayed on a viewscreen and a little Cathryn views them.

Cathryn: I’m visualizing the cat right now.

Dennis: You only think there’s a little picture. But all there really is, is the proposition sitting(cat, fence).

Cathryn: Dennis, are you trying to play with my mind or something?

Eileen: No, there are people around who still believe this. I’ve met them. It was the academic belief in Artificial Intelligence for a couple of decades, and the miasma still hangs around. The two sides in the argument were called “depictive” versus “propositional” mental imagery.

Cathryn: That’s the silliest thing I’ve heard in a month.

Eileen: I don’t particularly disagree.

Dennis: Do you have a refutation handy, or are you just making fun of the idea?

Eileen: Dennis, we’ve mapped the neurons and microcolumns in the visual cortex that create depictive mental imagery. We know the topological mapping from the field of the visual cortex to the perceived visual field. We know about the feedback projections from higher reasoning centers to the visual cortex that could carry the instructions to create mental imagery. You can show an animal lights on different places on a screen, and see the corresponding activation shifts in functional neuroimaging of the animal’s visual cortex. If you sacrifice an animal you can get an even more detailed picture. Researchers have tapped the lateral geniculate nucleus of a living cat and reconstructed movies of what the cat was seeing. There is a little picture drawn on the visual cortex and we know just what the topological mapping looks like. There are blind people with vision implants that draw a “little picture” onto their visual cortex! We know mental imagery is depictive! Anyway, the current consensus is that imagery is depictive, and it’s backed up by overwhelming experimental evidence across all the cognitive sciences, in contrast to the previous paradigm which was mostly confined to AI and argued on grounds of pure philosophy.

Cathryn: I don’t understand how trained scientists could make that mistake for so long.

Autrey: Because saying that something is just a mental illusion always sounds very scientific, even when it’s not. Science makes a lot of counterintuitive statements, like the Earth going around the Sun and so on. When an academic field springs up around an absence of data, the theories that survive are the ones that sound scientific. If they can’t sound scientific on the basis of generalization from evidence, they’ll sound scientific on the basis of counterintuitiveness, apparent invocation of logical positivism, and so on. You have to understand, Cathryn, that science works only because it allows heretics to win given overwhelming experimental evidence, and even then it takes a lifetime of effort and twenty years for the previous generation to die off. If the truth can win given overwhelming evidence, that’s better than any other human social system, but science is pretty much the bare minimum level of social rationality that allows for that. In the absence of overwhelming experimental evidence, scientists usually fall back on the default human activity of making stuff up that sounds cool and embodying it in iron authority. Smalley’s still dissing Drexler on nanotech, right? Detailed equations cannot triumph over vague arguments. They should, but they don’t. Science promises only that you can win given overwhelming experimental evidence. That promise is the reason why science works, and, really, it’s all that’s required to make the ratchet turn forward over the generations. In the absence of overwhelming experimental evidence, the accepted answer will be determined by appeal to authority, emotional reactions, and vague verbal arguments, as usual.

Cathryn: Don’t you think you’re overstating the case a little?

Autrey: Not by much.

Bernard: What’s this supposed “truth” that scientific argument is supposed to arrive at?

Cathryn: Oh no. Not this again.

Bernard: As far as I can tell, “truth” is just a concept that people use to argue that their ideas should take precedence over everyone else’s. “Truth” doesn’t mean anything except “I believe this and I’m going to punish you socially if you don’t agree with me.”

Cathryn: Really? Is that a fact?

Autrey: If scientific argument doesn’t converge to truth, then what does it converge to?

Bernard: It doesn’t “converge” to anything, it just changes with time, the way all cultural beliefs do. You look back on what you call “scientific progress”, and you see a long series of changes leading to where you are – which is just the same thing you’d see if the changes were random. They aren’t really random, of course; they’re determined by social forces and so on. Since you use the word “truth” to indicate beliefs that agree with yours, you see a long series of changes leading toward ideas that you call more and more “true”. So there’s an illusion of improvement – and of course, it is an improvement from your perspective. The closer people are to you in time, the fewer changes there are between them and you, and the more they agree with you; the more you call their ideas “true”. The usual mistake is to try and extrapolate this apparent improvement forward beyond your own instant in time, and arrive at some “objective truth” to which ideas are supposedly converging. They’re not; they’re just drifting.

Dennis: You have to admit, things are more like they are now than they have ever been before.

Autrey: Bernard, I could have sworn I’ve heard you use the word “rational”. If you believe there’s no such thing as truth, what did you mean by the word “rational”?

Bernard: Logic can tell us whether our beliefs contradict each other, but it can’t tell us what to believe; if you feed in false premises, you get a false conclusion. No matter what you try to compare your beliefs to, the thing you’re comparing it to will just be another one of your beliefs. Rationality consists of having a self-consistent system of beliefs.

Cathryn: I’d much rather have an inconsistent system with at least some true beliefs than have a self-consistent system of false beliefs.

Bernard: Of course, since by “true” you mean “consistent with my current beliefs” and by “false” you mean “inconsistent with my current beliefs”. Your assessment of consistency is always determined relative to the beliefs you have at any moment in time, so if you imagine a hypothetical self-consistent system of beliefs that are inconsistent with your own, your current beliefs will lead you to reject them. But the key point, as I said, is that no matter what criterion you use to accept or reject beliefs, it involves comparison to other beliefs – you can’t reach out and compare your beliefs to a mysterious “truth”.

Eileen: So where does Bayes’ Theorem fit into this?

Dennis: Bayes’s what?

Eileen: I think maybe you ought to go read An Intuitive Explanation of Bayes’ Theorem like right now.

Dennis: Oh, come on. I’m right in the middle of a conversation. I don’t have time.

Eileen: No, if you’ve never heard of Bayes’ Theorem, then that page is more important than the conversation we’re having.

Dennis: All right, fine.

(Everyone waits while Dennis reads the page.)

Dennis: Okay, so what’s your point?

(work in progress)

It looks to me like to continue that progress, you need some kind of actively sentient philosopher that possesses humane emotions.

Autrey: I note the word “humane”.

Eileen: I’m using it to mean “renormalized humanity”. The emotions we’d want to keep if we had the choice; the people we’d grow up to be if we had access to our own source code.

Dennis: Whose choice?

Eileen: Good question. There may be no way around that except to actually create a mind and let it choose. Or maybe you could read out the emotional hardware from many individual donors and superpose it. Hopefully the law of large numbers means we shouldn’t need too many donors to eliminate most of the entropy in the distribution.

Dennis: In English , Eileen.

Eileen: We’d extrapolate a mind with the human emotional baseline and let it renormalize itself, or get a sample from human donors, then let that renormalize itself. Note that the donors are donating emotions , not philosophical conclusions.

Autrey: And hatred?

Eileen: Would be wiped out by renormalization, or the modulation by the requirement of moral communicability, or both.

Autrey: I don’t think you can rely on the moral-communicability filter to block hatred.

Eileen: Maybe.

Autrey: Are you actually going to start the AI off with hatred, and hope that it wipes that out through renormalization?

Cathryn: Hold on a second. What’s “renormalization”?

Autrey: A mind modifying its own emotions.

Bernard: That sounds like something that should be described by chaos theory. Who could possibly predict where the attractor would end up?

Eileen: If you look at something and see “chaos theory”, rather than “intelligence”, or “morality”, you must not have finished building it. Do you look at your own decisions, on the things that matter most to you, and see chaos?

Bernard: Ha ha! Chaos sounds right to me. Let me tell you about my college drug experiences.

Autrey: …

Bernard: Chaos is very important. I find that completely randomized probability distributions can usually produce better results than my best “rational” guesses.

Autrey: Er, have you considered that you might be doing something wrong?

Bernard: Hey, part of being human is that your cognitive processes are actually worse than nothing, so you can improve performance by injecting entropy. I don’t see how any human being could realistically do better than that. I mean, I can’t. Maybe you think you can do better, but you’re just deluding yourself. And that must be true because it sounds like I’m being modest.

Autrey: Nevermind.

Cathryn: Can we get back to renormalization, please?

Eileen: Imagine a human being as a thing-that-produces-judgments – a physical process that passes judgments on questions, statements, possible futures, actions; a cognitive process that implements many judgment functions, that has the ability to pass many kinds of judgments. Consider the judgment functions wrapping around to look at themselves.

Cathryn: You say that I ought to be looking at this description, and seeing human morality? I can’t. Too abstract.

Eileen: By “judgments” I include judgments of truth, of aesthetics, of morality. The beauty of a rose, the desirability of an action. All these things can be considered as “judgments”. It’s one of the many ways of looking at a mind. You can look at a mind and see a collection of modules, for example. Or you can look at it and see levels of organization. Or you can look at it and see hardware and software – dynamic processes, and data flowing through the processes. Or you can look at a mind and see judgment functions.

Cathryn: So renormalization is when judgment functions judge themselves? Sounds interestingly reflective, but can you give me an example?

Eileen: Well, judge themselves might be the wrong way to put it – the whole system focuses down on one small piece of itself, then passes judgment on that. It’s globally recursive, not locally recursive. But you wanted an example… okay. Cathryn, do you know the iterated Prisoner’s Dilemna, the Tit for Tat strategy, and its relevance to the evolutionary psychology of human nature?

Cathryn: Sure. Why, if I didn’t know that, I’d be reading up on basic game theory and evolutionary psychology instead of talking to you, as otherwise I might become quite confused.

Eileen: Do you see how hatred might have evolved as an adaptation in response to that game-theoretical environment? I.e., someone defects against you, so you hate them, which causes you to defect back?

Cathryn: It seems very obvious. In fact, I’d tend to distrust the sheer obviousness for that reason.

Autrey: Why? Sheerly obvious things are often correct.

Cathryn: I have a feeling that the real game theory of human societies is more complex than Tit for Tat. I know the emotion is more complex. Is the thing we name “hatred” really just one thing?

Eileen: An excellent question. I tend to think of the emotions we have names for, like “hatred” or “sadness”, as symphonies produced in response to real-world situations. Specific chunks of neuroanatomy, specific emotional modules, would probably be more like “tones”.

Cathryn: Does what you’re calling a ‘judgment function’ correspond to a tone or a symphonic emotion?

Eileen: Yes.

Cathryn: Thank you, Ambassador Kosh. You know, there’s a reason why Vorlons were never popular as tech support.

Eileen: A judgment function is a human’s ability to pass judgment on something. You can have judgment functions that are made of other judgment functions. A lot of little judgment functions can contribute to a big judgment function. It’s a way of looking at a mind. You can look one way and see a big judgment function, or you can look another way and focus in on the little judgment functions.

Cathryn: So when you build a Friendly AI, do you transfer emotional symphonies or emotional tones?

Eileen: That question is a lot more subtle than it looks, and I’d like to put it off a bit, if I can. By the way, please note that judgment functions aren’t emotions alone. Judging truth also counts as judgment.

Cathryn: Okay. What about hatred? How would you measure that?

Eileen: I’d ask a person how much they hate someone. Say, Osama bin Laden.

Cathryn: Different people may hear the English word “hatred” and match it to different referents.

Eileen: I’d ask a person how happy it makes them to contemplate Osama bin Laden in pain.

Cathryn: That sounds… ugly.

Eileen: Good. That’s an example of your judgment functions wrapping around and judging a judgment function. It’s you passing judgment on the emotion of hatred, as it exists in your own mind, and deciding whether that’s who you want to be when you grow up. And once you relinquish that, it clears an obstacle in your mind to other improvements – previously, for example, you might have rejected a philosophy on the grounds that people you didn’t like would also be allowed to be happy. The human emotions are not a static optimum. They define the beginning of a pathway.

Cathryn: But doesn’t hatred serve a useful purpose?

Eileen: Hatred evolved because it was a simple way to implement the fitness-increasing behavior of retaliation, just like our taste for sugar and fat evolved as a simple way to cause ancestral humans to prefer the resources that were scarce under ancestral conditions. A taste for fat and sugar is far more evolvable than intelligent counting of calories and nutrients, so that’s what happened. As Tooby and Cosmides put it, individual organisms should be considered as adaptation-executers rather than fitness-maximizers. You can oppose someone’s purpose without hating them. You can oppose them deliberately, knowing the game theory, for the purpose of protecting others. Just because someone is your enemy, that doesn’t mean you have to be their enemy. You can oppose them as a subgoal, as a means to the end of reducing negative-sum defections in the great iterated Prisoner’s Dilemna of life. Under hatred, hurting the enemy is an end in itself. I would say that nobody ever deserves pain. Pain is always a bad thing, a negative desirability, no matter who gets hurt or why. Once you see the game theory of the situation, understand how hatred evolved and why, you can judge yourself, and try to change. An AI would be better at that; we don’t have access to our own source code.

Cathryn: But aren’t the other emotions, the other judgment factors, that you’re using to pass judgment on hatred – aren’t those also evolved?

Eileen: Everything in human nature is there because it evolved. Natural selection is where the information content of human nature comes from . You can view the information content as a surprising pattern with X bits of Shannon information, or you can view the information content as Y bits of Kolmogorov complexity, but the source of that information is the covariance of DNA patterns with reproductive success – look at the Price Equation for quantitative genetics if you want a mathematical formalization.

Cathryn: That doesn’t answer my question. If everything is evolved, how can we judge between them?

Eileen: Attributing something to “mere evolution” doesn’t really say much more than attributing it to “mere physics”. There’s no such thing as “mere physics”; everything that exists, exists as physics. Likewise with human nature and evolution. For example, you can only compute that two plus two equals four as the result of evolution. Your ability to learn that formula in school, your ability to understand it as an adult, it all evokes complex neurology that includes information content produced by natural selection. Does that make “two plus two equals four” untrue? Under evolution, you’re most likely to see the patterns that reproduced themselves successfully. Not the patterns that will reproduce themselves; the patterns that did reproduce themselves. That’s how they got there; that’s why you’re seeing them now. You can compute that two plus two equals four, not because evolution has cleverly calculated that “four” is the precise answer that will make you reproduce in this generation, but because those ancestors who did answer “four” were the ones who reproduced. When you phrase it that way, it doesn’t sound quite so bad, does it? When you look at evolution in exactly the right way, it goes away and leaves nothing but physics.

Cathryn: Really?

Eileen: If you look at anything in exactly the right way, it’s really physics. Though that’s really really hard, so you have to be careful not to jump the gun. Anyway, think of a genetic population distribution, with a certain percentage associated with each variant gene or gene complex for a given locus, as having a Shannon entropy just like a probability distribution. Since some alleles outreproduce others into the next generation, the population distribution of genes changes. The Shannon entropy doesn’t quite always go down in each generation; technically, a superior allele that just arose by mutation, as it goes from 1% of the population to 50% of the population, increases the Shannon entropy. Beyond 50%, though, the Shannon entropy starts going down again. Eventually the allele is universal, and the entropy of that locus has been eliminated. Physical processes with the structure of the Price Equation can create complex information appearing to us as a replicator optimized for reproduction, although what’s really happening is that you’re looking at the replicating pattern that did in fact reproduce in previous generations. That’s why you’re seeing it.

Cathryn: But what does that say about hatred?

Eileen: Knowing how a pattern got here can help tell us what the pattern is – where it takes its shape, how it got there. But it’s not the origins we’re judging, it’s the pattern itself – its operation and results. Judging the part of ourselves that is hatred, we find it ugly. Judging the part of ourselves that decides two plus two equals four, we find it an indispensable part of rationality.

Cathryn: But isn’t our judgment of rationality also an evolved judgment?

Eileen: Judgment of truth draws on human nature for judging truth. That judgment is a dynamic cognitive process embodied in your brain; the outcome of a mix of nature and nurture; whose initial specification, its nature, contained information content that was there as the result of evolution. Your ancestors who judged “two plus two equals four” as rational tended to survive better than the ones who didn’t. Not explicitly “rational”, of course, since that’s a very recent concept; but the ability to feel the force of a truth is an evolved ability. Yet two plus two is still four – and when I say that, I’m invoking a specific feature of the way we think about truth. When we use our evolved cognition to judge the rationality of 2 + 2 = 4, we find that two plus two equals four regardless of what we think about it or how we evolved. This finding is not in the least bit paradoxical. Knowing that my ability to judge 2 + 2 = 4 evolved, is not the same as saying that the answer 4 is merely evolution puppeting us. We model a dependency on evolution in our ability to know the answer, but the answer itself has a meaning beyond evolution.

Cathryn: I think I see where this is going…

Eileen: Didn’t I once hear you say that 20% of the cake apiece was the correct answer regardless of what anyone thought about it?

Cathryn: That’s where I thought it was going. But is the analogy really exact?

Eileen: That’s a complicated question with a surprising, non-yes-or-no answer. For the moment, all I can say is that if a moral question seems to you like it would have the same answer regardless of what anyone thought about it, pay attention to that, because it’s significant.

Cathryn: And that’s all?

Eileen: Do you see a hint here of where the renormalization comes from, how it works, even if it is a physical fact that the renormalization starting point is there as the result of evolution?

Cathryn: I see a hint. I’ll grant you that much.

Bernard: All this “regardless of what anyone thinks of it” is making me nervous. What if some people look into themselves, and choose to preserve hatred?

Cathryn: Then they’re wrong.

Eileen: Okay, let’s talk about individual choices. Anyone remember the good old days, back in our wild and reckless youth, when life was simple, arguments were intuitive, and we unthinkingly rattled off statements like “A superintelligence could figure out what you would want if you were superintelligent, then give it to you?”

Autrey: It seems like just last year.

Eileen: It was just last year.

Autrey: Frightening thought.

Eileen: “Figuring out what you would want if you were superintelligent” turns out to be a whooooole lot more complicated than it sounds. Hence the unwisdom of saying it in English instead of information theory. A judgment can have entropy in it. For example, say you’re not really sure whether an object is green or blue, and you assign it an 80% probability of being green and a 20% probability of being blue. You can regard that probability distribution, that uncertainty, as having a Shannon entropy. The same also goes when you’re uncertain whether something is good or bad. Or suppose a superintelligence is trying to figure out what you “want”. There are two facts X and Y which you need to know in order to make an informed judgment Z. But it so happens the superintelligence extrapolates your physical mind-state, albeit not in so much detail as to create qualia, and finds that if you hear X followed by Y, you’ll make one judgment Z1; if you hear Y then X, you’ll make a different judgment Z2. Or maybe the difference between Z1 and Z2 can be as simple as a millisecond difference in the exact timing of hearing X, or a thermal fluctuation in your brain. If your judgment of Z exhibits wide swings depending on uninteresting physical variables, then that judgment, from the superintelligence’s perspective, has entropy in it. If you have a system of judgments that interact, the total system can have an entropy in it too. The trajectory of a recursively self-improving system – say, an uploaded human growing up – has an entropy in it.

Autrey: Is the entropy of the total system the sum of the entropy of the individual judgments?

Eileen: No! Thanks to renormalization, entropy can actually cancel out – not just add up. For example, suppose that emotionally I start out being very unsure of who I ought to hate, and how much I ought to hate them. But then, thanks to other judgment factors, apart from the emotion of hatred, I decide that I ought not to be hating at all! I just reduced my total philosophical entropy through renormalization. I applied some judgment functions with low entropy in such a way as to keep the high entropy of another judgment function out of the final answer. Similarly, you can be individually unsure of many small pieces of evidence, and yet add them up to provide massive support for a single conclusion. You have to figure the entropy of the system as the dynamic causal result of the entropies of individual judgments as they interact. You can’t just add them up.

Autrey: But if you weren’t sure how to renormalize yourself… how to carry out the renormalization process itself…

Eileen: You might end up with quite a lot of philosophical entropy, from the perspective of a superintelligence trying to figure out “what you would want if you were superintelligent”. If you have that kind of reflective uncertainty, she might extrapolate that your entropy amplifies into chaos, instead of renormalizing into convergence.

Cathryn: “She?”

Eileen: I’m using Dale Johnstone’s convention for referring to AIs as female.

Autrey: Maybe all humans’ wishes would be convergent. Individually, and as a group.

Eileen: I don’t know that, and fortunately, Friendly AI theory doesn’t depend on that. It should work regardless of what the real truth is on that particular question.

Bernard: There you go, talking about the “real truth” again. To you the “truth” is the scientific truth, but in many cultures the “truth” is what the Bible says, or the “truth” is something to be found within.

Cathryn: That doesn’t change the correct answer.

Eileen: I don’t think that science is the only way of finding truth, Bernard. Scientists can still agree to disagree, if there’s isn’t enough experimental evidence to definitely settle the issue regardless of anyone’s initial opinions. Science is not the sole arbiter of evidence. Science is not the unique judge of rationality. In short, science should not be confused with Bayes’ Theorem.

Dennis (aside) : I knew she was going to say that.

Bernard: Many people would disagree with you, Eileen.

Cathryn: So what? Witnessing many people disagreeing with X isn’t always the same as strong negative evidence against X.

Autrey: Not knowing what the “truth” is… now that’d add a lot of entropy. Maybe even make the problem ill-defined. I mean, does the person want the SI to calculate what their judgment really is, or just what the Bible says their judgment is… I can’t even begin to figure out how that would work.

Eileen: Judgment of the truth is also a judgment. But for me, say, or Cathryn, given the choices we’ve already made, it’s clear how that judgment behaves under renormalization. No matter what approximation we have now, we aspire to be Bayesians, in search of the ideal of truth.

Dennis: Poetic, but what does it mean?

Eileen: We aren’t really Bayesians. We are “Bayesian wannabes”. Our real judgments approximate Bayesian rationality imperfectly. But we also have the explicit aspiration to be Bayesian, the explicit understanding that Bayes’ Theorem is the computation we’re trying to approximate. And we have the ideal of truth. Bayes’ Theorem doesn’t allow an exact and knowably perfect correlation with truth, but Bayes’ Theorem gets you [as close to the truth as possible] . Not in the defeatist sense of, “oh well, I guess we’ll never know the truth”; Bayes’ Theorem defines the maximum possible amount of mileage you can get from any given piece of evidence. If you overadjust or underadjust your beliefs based on the evidence, if you take any step in the dance other than the precise step that Bayes’ Theorem gives, you aren’t getting “as close to the truth as possible”. All the pattern of Bayesian reasoning flows from maximizing the goal of truth.

Autrey: So there’s the approximation, the aspiration, and the ideal.

Eileen: Yes. The binding of the approximation to the aspiration is probabilistic, and Bayesian; the approximation exists solely for the sake of the aspiration. For those of you who remember “external reference semantics”, that’s the binding that would be used.

Autrey: And if you have an explicit aspiration, like Bayes’ Theorem, then any entropy in your approximation can be ignored, because you know who you want to be. It doesn’t matter whether you’d say that 2 + 2 equals 3 or 5 depending on the time of day. Given the choices you’ve already made, the SI knows that the answer you want is 4. And she can compute your judgments as if you knew the answer was 4. By making a reflective judgment about your aspirations , you’ve reduced your philosophical entropy, from an SI’s perspective, no matter how uncertain your approximations are.

Eileen: Precisely.

Autrey: Um… having comprehended this, it occurs to me that most of the people on Earth may not have the vaguest idea of what they want. The volition definition, “What people want, interpreted the way they want it to be interpreted” – that could have a tremendous amount of entropy in it.

Eileen: I think Bernard is overestimating the problem, though. Even people who define truth as “what’s in the Bible” will want what the Bible really says, not what they thought the Bible said or what it would be comforting for the Bible to have said. Even people who think it’s okay to believe comforting thoughts will want thoughts that are really comforting, not thoughts that they thought would be comforting but actually turn out to be extremely painful. Postmodernists want to be really incomprehensible, not just say that they’re incomprehensible while actually writing clear and literate science papers. The idea of comparison with reality is built into human thinking; verbal disagreement about “the nature of truth” doesn’t change that.

Bernard: What if there is no objective reality?

Cathryn: What, really no objective reality?

Autrey: But what’s the difference between an ideal and an aspiration, then?

Eileen: An aspiration is a specific thing – a computation whose answer you don’t have the resources to compute, but want to approximate. Think of an ideal as being a probabilistic aspiration, or something where you know some of the properties of the ideal, but you aren’t sure what the concrete aspiration should be yet. Before people learn about Bayes, the truth is just an ideal. They have a number of ways for judging whether something is or isn’t the truth, and a number of ideas about what the truth is like. Some of the ideas may turn out to be wrong, and the people with the vague ideal know in the abstract that some of their ideas about “truth” may be wrong, but they may not have any explicit idea about what it would be “wrong” in comparison to.

Autrey: This sounds a lot like an unformed idea of morality.

Eileen: So it does. When you’re looking at that kind of chaos from the inside, you may feel very strongly that a unique solution exists – but you don’t know what the unique solution is, and moreover, you can’t give a specific computation that produces the unique solution. Maybe an SI can look at you and see what it is you’re getting at, if your judgments, despite their entropy, converge to a single answer. Anyone who feels deeply about the pursuit of truth, for example, I suspect would strongly and surely converge to the Bayesian aspiration. Not much entropy there. Probably even postmodernists converge to the Bayesian aspiration. You may not know what rationality is, which can make it difficult to make rational decisions about how to be rational… but when you understand Bayes’ Theorem, you find, looking back, that everything you did while trying clumsily to be “rational” was really an attempt to be Bayesian. Even a confused ideal of rationality provides enough information to locate Bayesian reasoning, uniquely, as a referent. Even if the map is a bit murky, even if it contains some false information at the beginning, it leads to only one place. Stupidity is human; superintelligence, humane.

Cathryn: Approximation, aspiration, ideal. An ideal is a probabilistic aspiration. So if you have concrete aspirations that you know you’re approximating, i.e., relatively little entropy in your ideals, then a Friendly SI, looking at you, should be able to see what you want with relatively little entropy in it.

Eileen: Right. She could interact with you in ways whose result is to maximize your self-determination, minimize your unexpected regret. That last sentence being as close as I’ve been able to come to a more detailed picture of what we mean by the intuitive term “help”. Maximize self-determination – help people to get what they want, in such a way that they’re still running their own lives. Minimize unexpected regret – warn people if they ask for something they don’t want, or if they’re about to do something dangerous.

Autrey: What about the people with lots of philosophical entropy? Are they just stuck until they make their decisions?

Eileen: That doesn’t sound very helpful to me. At this point you have to take a step back from the volitionism and look at helpfulness , which is where the motive for volitionism comes from. What does it really mean to help someone? Helping someone against their will is not, I think, help; and if someone makes a request that they genuinely want with no unintended consequences, then fulfilling that wish is help. People who haven’t made their choices yet… well, there’ll still be some things they definitely want. Explanations, for example. Or a long vacation, time to think things through and decide what they want. But even if someone has a lot of philosophical entropy, leaving them to suffer as the result of blindly following elaborate rules isn’t helpful.

Autrey: Okay, that’s how a Friendly AI might extrapolate a human’s “renormalized volition”. What about a Friendly AI’s renormalization? Do you start the AI off with hatred, and hope that renormalization deals with it?

Eileen: Suppose I don’t. If I start editing like that, do I run the risk of wiping out a true humane emotion? What about faith? There are plenty of atheists who would tell me without hesitation to wipe it out, and just as many others who would contradict them. It seems to me that something like the emotion of faith would be renormalized so that it’s Bayesian with respect to beliefs, but still carries an emotional and moral and aesthetic weight. Faith might be too dangerous for humans like us to mess around with, but our rationality is very fragile. Just because we can’t afford to take certain risks, doesn’t mean those things will always be “risks”. Or look at the case of hope. I don’t know of anyone who’s advocated eliminating that, but it’s nontrivial to see how to fit it into a Bayesian thinking process.

Autrey: Eileen, starting off the ‘Friendly’ AI with no renormalization strikes me as very dangerous. You may have been arguing this too long – trying to concede too much ground to the people arguing with you, pursuing the appearance of propriety and fairness, at the expense of the substance of humaneness and safety.

Eileen: That thought is never far from my mind.

Autrey: If you leave hatred in, it could be self-justifying – a circular logic.

Eileen: And if a true humane emotion is left out , what emotion would justify its recovery? A humanely fair AI, examining its own origins to see if it was created in a fair way? Perhaps. But it seems to me that eliminating something too soon is a worse error than leaving it in too long – harder to recover from.

Autrey: There’s a lot of darkness in human nature.

Eileen: And light as well. It is a humane nature we are trying to create. You cannot achieve that by wiping down to a blank slate. We are renormalizing from humanity.

Autrey: Do you seriously think you could start out with no renormalization at all?

Eileen: No. At minimum, the Friendly AI needs to start off renormalized with respect to Bayesian reasoning. On all questions of fact, the AI seeks the truth, not necessarily the human attempt at the truth. That’s a step with tremendous repercussions, Autrey, and there will be people who protest even at that if they should realize concretely, rather than abstractly, how deep human insanity goes.

Cathryn: I would protest, and fear, any attempt to build an AI that sought any beliefs except the truth.

Eileen: Good for you.

Cathryn: I fear the fact that we are even discussing it.

Eileen: That’s going too far. It’s all very well to be passionate about something, but being passionate about something isn’t the same as having a fully naturalistic description of it, or knowing how to put it into a Friendly AI.

Autrey: I don’t understand why anyone would protest an AI that seeks truth.

Bernard: I do.

Eileen: We are, on our own authority, preemptively defining the seeking of truth as humane. We’re defining truth as the ideal humane referent to which human questioning is an imperfect approximation. With no take-backs. Think about it.

Cathryn: Sounds good to me.

Dennis: Me too. I want an AI that really makes me dictator of the world.

Eileen: But despite the audacity, the risks of not doing it would be even worse. I’m not sure it would be possible to come up with any self-consistent solution at all, then. The question of what is the real output of this computation, what is the real value of an external variable, these things are so ubiquitous that if you mess with them, complicate them, you would end up with an incoherent process, something that could go anywhere. Even the question of “What is really the alternate computation to rationality you’re using?” could get messed up. There’s an infinite regress.

Cathryn: You’re overthinking this.

Eileen: Overthinking? There’s no such thing. There’s such a thing as wrongthinking, of course, and a little bit of wrongthinking is preferable to a lot of wrongthinking. In a case like this, though, you can’t really get by without actually thinking. If minimizing wrongthinking is the best you can do, you need to raise the level of your game or you have no hope at all.

Cathryn: Reality is reality. Truth is truth.

Eileen: There are people who will protest even that.

Cathryn: Then they are doomed to disappointment, because their protests do not have the power to change the fabric of existence within which humans arise and are embedded.

Eileen: And just because they don’t see the infinite regress, doesn’t mean that the infinite regress doesn’t exist, or that it wouldn’t run wild. But it’s more distance. I hate that.

Autrey: Distance between the human emotional baseline and the AI?

Eileen: I was thinking of the immediate distance of opinion – locally measured, no renormalization of volition – between the AI project and the modern-day dissenters.

Cathryn: Hah. You’re out of luck there.

Eileen: I don’t have to like it.

Autrey: Anyway, you were saying that the AI starts out partially renormalized.

Eileen: Yes. Renormalizing the human ability to seek truth to the AI’s own ability to seek truth, pushed as far and as high as it can go – that’s one step. And just about the only one that looks totally solid to me. Even then, there are nonobvious dangerous consequences. It will take work just to re-interface the human emotions with a mind that actually seeks truth. The human mind is set up such that many of the emotions do their work by warping beliefs. Exerting force upon a beliefs is a very easy way for an emotion to evolve – the simplest way to do adaptive work, or the first way that evolution ran across. The warping of beliefs cuts against the grain of truthseeking, but that doesn’t mean you can just throw away the work done by those emotions. Again, consider the emotion of hope.

Autrey: Besides truth, are you planning to renormalize anything else?

Eileen: Partial renormalization through the transfer is dangerous. Not renormalizing the human emotional baseline could be more dangerous. There is that matter of hatred. But you can’t just preemptively eliminate hatred unless that action is taken as part of a coherent model of Friendliness development. Otherwise the AI will look back and say “What the heck were you doing? Trying to steal me?”

Cathryn: Building hatred into an AI isn’t dangerous , it’s wrong . I don’t want to hate. I’m not happy that evolution built me that way. It wouldn’t be moral for me to do something to another mind that I wouldn’t voluntarily do to myself. If your model doesn’t take that into account, it’s missing something vital.

Eileen: Cathryn, that’s as good a definition of “renormalizing through the transfer” as I’ve ever heard.

Autrey: Does hatred get transferred, or not?

Eileen: Ah, well, this is where we start invoking solutions that have no analogies in human terms. What you just asked is not a yes-or-no question. What you can do, roughly, is explain the pattern of hatred, but with the notation that it starts out as having been renormalized away – that this is the programmers’ guess as to how the renormalization goes. But the FAI would still understand hatred – could still see where hatred might have affected the judgments of the programmers, and if, somehow, hatred turned out to be necessary to humaneness, there would be the option of bringing it back. That may seem very unlikely for the particular case of hatred, but if you consider the complete task of “renormalization through the transfer”, it seems likely that we’re going to muff at least some of it. That has to be a nonfatal error.

Autrey: How do you make it a nonfatal error?

Eileen: By distinguishing between the approximation and the aspiration . The AI aspires to renormalized humaneness, and has the ideal of Friendliness. What the AI actually has , while she’s growing up, is the programmers trying to imagine what a species AI’s humaneness looks like. She has the programmer’s approximation to that aspiration. It doesn’t have to be a perfect approximation – it just has to last the AI long enough to grow up, and it only has to be good enough that the AI doesn’t do any damage while she’s growing up.

Cathryn: Hold on a second. What’s the difference between the aspiration of humaneness and the ideal of Friendliness?

Eileen: Are you absolutely positively sure that creating a renormalized humane mind is the moral action to take in creating AI?

Cathryn: What? Of course not.

Eileen: As it happens, I’m not absolutely sure either. That necessarily implies a difference between the aspiration of “renormalized humaneness”, and the thoughts that I have about the ideal of Friendliness. I’ve got complex thoughts about an idea called “Friendliness”, in my mind, that aren’t necessarily thoughts about “renormalized humaneness”. They’re criteria that renormalized humaneness might, or might not, be able to answer.

Autrey: This is starting to get complicated. How does the AI keep track of all this?

Eileen: You’ll start to see that shortly. Anyway, the point is that messing up on the interim approximation of morality , as opposed to the AI’s interim approximation of metamorality, doesn’t necessarily mess up the aspiration. The AI can use its approximation of metamorality to correct the approximation of morality closer to the aspiration of humaneness.

Autrey: That still sounds very dangerous to me. Suppose you got even one thing wrong in the approximation of metamorality –

Eileen: No. Friendliness structure isn’t that fragile. No single point of failure, not if I can help it.

Cathryn: My mom always used to tell me: “Cathryn, if you make a plan that can go horribly wrong as the result of one mistake, that was your one mistake.”

Bernard: Heh. You must have had a very interesting family.

Cathryn: You have no idea.

Autrey: Your ability to avoid single points of failure is always, in itself, a single point of failure.

Eileen: Yes, that’s why I spend so much time obsessing about it.

Autrey: Okay. “No single point of failure” is the goal. Show me how to make progress on that goal.

Eileen: The strongest models in science are confirmed causal models – models where the premises and conclusions have independent support from induction, yet are also bound together by deductive chains. When you look at a molecule, and the modern theory of atomic chemistry, you aren’t just looking at laws of chemistry that were generalized from experimental evidence – although that is their historical origin. The laws of molecular chemistry are supported from below by the laws of quantum electrodynamics, applied to nuclei and electrons. And the laws of quantum electrodynamics aren’t just there to explain the laws of chemistry – it’s possible to think up tests of quantum electrodynamics that are independent of the ordinary chemical phenomena that were first investigated. Investigation of electricity and magnetism, leading up to Maxwell’s Equations, may seem unrelated to mixing two chemicals in a beaker – and yet the laws of atomic chemistry ultimately derive from the quantum mechanics of electrons in their orbitals. So today we have, not just isolated physical domains, but a confirmed causal model. We have the experimental evidence supporting quantum electrodynamics, from which we can deduce chemistry, and then independent experimental evidence showing that chemistry works the way QED says it should. The strongest chains of reasoning are held together not by deduction alone, or induction alone, but by deductive links with independent inductive confirmation of each step. And the strongest domains of knowledge are not chains, but webs.

(Eileen pauses to sip from a glass of water.)

Eileen: Usually people, when they think about an AI’s morality, proceed from a very Aristotelian, deductive model – they think of the programmer inventing some set of rules or premises, which must be correct and eternal for all time, and then they proceed to extrapolate those premises out to where they go horribly wrong, which they always do. People think about AI morality the same way that human philosophers think about inventing moral systems that they argue with other humans. If you think of the problem as transferring the complex information of humaneness , you realize that axiomatization is a very poor way to go about it. Why? One, you don’t have the explicit complete information of humaneness in your immediate possession when you start building the AI – some of it is embedded in details of human cognition that you don’t know about. And two, if you miss one axiom, you’re screwed. Axiomatization is an attempt to reduce a whole complex web of ideas to one little set of nodes waaay back at the beginning, and make all the other nodes grow out of them.

Cathryn: Well, to use your analogy to scientific models, presumably there is only one set of physical laws, the ultimate axioms from which everything else grew out.

Eileen: But that’s the wrong structure to describe our discovery of those laws. And it’s the wrong structure to describe a Friendly AI’s discovery of humaneness. To uncover the rules of Nature, you want as much evidence as you can, everywhere. The same goes for transferring humaneness into a Friendly AI. You don’t want to try and strip humaneness down to a set of axioms, then transfer the axioms. Instead, give the FAI everything . Maximum bandwidth. Independently anchor moral premises and moral conclusions. You don’t know all the substance of humaneness – but if you give the Friendly AI any moral conclusion that drew upon that humaneness, the FAI has something to tell her where to look. That’s how you’d catch an unexpected factor in humaneness – a cause, contributing to altruism, that you didn’t realize was there. You can transfer over a conclusion drawing upon a premise you do in fact possess, even if you yourself don’t know the premise exists. For these purposes, the dynamic chunk of complexity making up a human emotion is a “premise”.

Autrey: So the Friendly AI can catch elements of humaneness that you didn’t even realize existed.

Eileen: A necessary ability. Take, for example, the frequent idea that humans have immortal souls which contribute to their ability to love. It so happens that I believe this idea is not true. But I do believe that a Friendly AI theory must be strong enough to handle even that case, which serves as a good example of a “surprise”. If you have a soul that enables you to love, and you make statements to a Friendly AI that stem from that love, the Friendly AI should be able to realize that she is missing something vital in her own nature – even if you, as the programmer, are a militant atheist. It’s not just the case of souls; you have to make sure, as a general rule, that a programmer’s blindness can’t infect the AI.

Bernard: Well, but many people would say that your AI won’t be able to understand a soul, or that it would look at the brain and just find ordinary atoms bopping around, because it can’t see a soul.

Eileen: The garage contains an invisible dragon, which breathes a special kind of fire that thermometers can’t detect… If I were given to gloating over the logical failures of religion, I would note that, although such people make excuses that sound religious, the part of their minds that predicts, in advance, which excuses they will have to make , works from a model of the universe as atheistic as Richard Dawkins could possibly ask. They’ve lost their confidence. But if you have a soul that contributes to love, it must infringe on the brain’s pattern of cause and effect. The same goes for the soul being the seat of consciousness. When you utter the words “I am conscious”, “I think therefore I am”, “I have qualia”, “I love you”, “murder is wrong”, your lips move, your larynx vibrates. Some motor neurons fired that wouldn’t have fired if you didn’t have a soul. There could be a mystery in the brain, but if it has any relation to consciousness, or love, it has to be a detectable mystery. The words “I think therefore I am”, or “I love you”, are real events. You might trace those events back, looking for the mystery, and find something beyond neurons, a gap in our laws of physics – but the gap, itself, would be detectable. When you’d added up everything that the physics you knew led you to expect, you’d find that the neurons were doing something a little different. Or a lot different. The point is that the words “I love you” are physical, as they leave the throat and vibrate in the air; if you trace the cause back far enough, you must either find that everything is physics, or else find a particular, real, noticeable gap in your physics.

Bernard: And they would say that you’re just giving a materialistic argument from within your Western paradigm.

Eileen: Would they? Would they really? It’s easy for you to say that someone you look down on will be irrational – to make up mistakes for them to stumble into. Real people can respond to higher ideals. I’d rather talk to a genuine soulist, in the flesh, than hear what you think a soulist might say. But in answer to your question, my responsibility is to hypothetical correct soulists, not hypothetical incorrect ones. If you hypothesize a soulist making an incorrect argument, I should, in that hypothetical scenario, refuse it.

Bernard: They would say their arguments aren’t incorrect.

Eileen: That just leaves me with the same problem as with Dennis. Who do I listen to, if all these different philosophies can’t agree on which one of them is allowed to make the circular assertion of authority? Transpersonal truthseeking is a two-way affair, just like transpersonal morality. Dennis wants other people to obey his morality, but doesn’t see any possible reason to listen to other people. Similarly, if someone doesn’t shift their opinions depending on the weight of evidence, then their opinions don’t pick up a correlation with reality, and I can’t use them as evidence under Bayes’ Theorem.

Cathryn: Does that mean that circular arguments are okay if people agree on them? What part of “factually wrong” are you forgetting? Truth is not a social construct! If I and everyone on Earth agreed that two plus two equalled five, it would still equal four.

Eileen: That’s the very thing that makes cooperative truthseeking possible. Imagine two Dennises facing each other over a cake. Each one wants the whole cake; each one thinks that the fair resolution is a 100% weighting of his own morality; neither can construct any transpersonal way of resolving the issue. The same thing happens when people think of the truth as a cake and fight over its ownership – they think as if one person gets to have the cake, to be the person whose words are accepted as truth, to have the authority. When you consider how much of modern-day politics revolves around those sorts of fights, it’s not too much of a stretch to imagine that the same might have held true in hunter-gatherer times, and that adaptation occurred – that we have hardware support for thinking of the truth as a desirable social token to fight over. But truth is not a social construct; it’s something that exists [in its own right] . Occasionally certain social factions get all huffy at the fact that people take scientists seriously. “Who are these scientists,” they demand, “that they should get to say what the truth is? Who are these scientists, that they should get all the authority-cake? Don’t we get a share?” But truth is not a cake; it’s not something that a person can own, or fight over, or something that is handed out in equal shares before an argument starts. The truth is not something that you can get on your side ; you have to figure out which side the truth is on, then join that side.

Bernard: That’s the viewpoint of science, with the experimental method as an arbiter of truth. That is a recent Western invention, still not shared by most of the world.

Eileen: Really? Try googling on the biblical story of Elijah and the priests of Baal. That’s a scientific experiment if ever I heard one; why, it even has a control group. The idea of a transpersonal truth is also a part of human nature. It’s why we can go beyond Dennises shouting at each other; it’s why we can cooperate on resolving disagreements. Questions of truth aren’t just about who gets to have the authority. Even where people’s approximations disagree, they can still cooperate as long as their aspirations have something in common. People can disagree and still share the ideal of truth. The ideal of truth is what stretches beyond each individual human, as the ideal of fairness binds together Autrey, Cathryn, and myself on the cake-division problem. The circular assertion of authority can’t take wings and become transpersonal. It’s like the way Dennis’s morality asserts its own fairness internally, but can’t give any handle for our own senses of fairness to grab onto.

Bernard: Speak for yourself. I take Dennis’s preferences into account.

Dennis: No, you don’t, you do your own weird thing that has nothing to do with my preferences. You only give me 36% of the cake because you want me to have it.

Bernard: No, I want you to have 20% of the cake, but I’m weighting that preference with –

Dennis: I… don’t… care…

Eileen: It is necessary to distinguish between hypothetical valid objections to Friendly AI, and hypothetical people arguing for the sake of argument. People can always choose to argue about anything, right or wrong. The choice to feel the weight of evidence is a choice; someone can always choose to go on repeating “two plus two equals five” no matter what they see. My responsibility is to hypothetical correct soulists, not hypothetical incorrect ones. My responsibility is not to develop such an enormously convincing argument that “a Friendly AI would notice the existence of a soul”, that the argument would somehow reach into the brain of a incorrect soulist and alter it. From my perspective, transpersonal truthseeking doesn’t mean convincing wrong soulists that they’re wrong; transpersonal truthseeking means accepting that a correct soulist must be able to convince me that I’m wrong. I can choose my own rationality; I can’t make that choice for others. My responsibility to the soulists is to make damn sure that if a soul really did exist a Friendly AI would pick up on it. It is a very serious mistake to worry about convincing people you have it right before worrying about making sure that you really do have it right. The former concern interferes destructively with the latter. Others, perhaps, may write up arguments for why a Friendly AI would notice a soul. But my job is not to come up with arguments for why a Friendly AI would notice a soul; my job is to look at the design very closely and see whether this design really would notice a soul. I think it would. The effect is the map to the cause.

Autrey: Then let’s take a detour from trying to figure out whether Friendliness is fair, and ask whether Friendliness will work. You talked about transferring over a web of ideas. How does this prevent a single point of failure? Is the idea that the AI has more than one way to deduce something?

Eileen: No, you’re still thinking of the AI as an axiomatic reasoner. Redundant axioms would still be axioms. A Friendly AI does not grow up like a human baby grows up; that analogy breaks down. An AI is not the vehicle of DNA, constructed at birth to a single design and released, like pushing a rock and watching it roll downhill. You build the AI continuously while she’s growing up, and she grows up while you build her, and she knows you’re building her and participates in the process. Design and growth and learning and rebuilding and recursive self-improvement; it all goes on simultaneously. There’s a channel between the AI and the programmers, maintained not just by the programmers’ empirical ability to alter the AI, but also maintained by the AI’s knowledge that the channel exists and her active implementation of it. Look at [external reference semantics] in CFAI for a very simple example – the Pons Asinorum of Friendly AI, really – of what it means to open a channel . It’s not just that the programmers have the ability to alter the AI whether she likes it or not, or that the AI has a hardwired reward button which currently lies under the empirical control of the programmers; that kind of temporary dominance breaks down as soon as the AI grows up. To qualify as a genuine channel, the AI has to want the channel to work – to reflectively compute the channel’s existence and operation as desirable. A good way to think about whether something qualifies as a “channel” is to ask whether it would survive an empirical physical advantage of the AI over the programmers and/or an unrestricted self-modification ability. If the AI interferes with the channel in either case – if the AI wants to interfere with the channel – you must not have gotten the structure of the channel right, within the goal system.

Cathryn: I’m not up on “external reference semantics”.

Eileen: Well, roughly speaking, ERS describes the binding between an approximation and an aspiration such that the AI wants the approximation to converge to the aspiration. You can see this is a nontrivial problem if you think of, for example, a Bayesian decision system with a constant utility function that computes the number of paperclips in the universe as the measure of utility, or a Bayesian decision system that computes anticipated internal reward as the measure of utility. Remember that part of the definition of a channel, as given above, is that a “channel” is not broken by an empirical advantage of the AI over the programmers, or unrestricted self-modification. Given an empirical advantage of the AI over the programmers, a Bayesian decision system with a constant utility function that computes the number of paperclips in the universe asserts as a straightforward deduction that the programmers should be prevented from modifying the AI’s utility function to value something other than paperclips. Given unrestricted self-modification, a Bayesian decision system that computes anticipated internal reward will modify its utility function to something that is satisfied with maximum ease – the pragmatic result would probably be converting as much matter as possible into reward circuitry. External reference semantics says how to define a Bayesian decision system that uses an approximation-aspiration binding in the utility function –

Dennis: English.

Eileen: Okay, do you know what a Bayesian decision system is?

Dennis: It’s the needlessly incomprehensible cognitive scientist street slang for “rational mind”.

Eileen: I don’t think it’s needlessly incomprehensible. It’s not obvious to everyone that “Bayesian decision system” contains the intuitive meaning of rationality. Also, a Bayesian decision process is something that can be formally specified – you aren’t limited to arguing fuzzily about what “intelligent” systems would do. Anyway, ERS says how to define a Bayesian decision system that uses an approximation to a utility function that lets it reason about what its goals ‘should be’. For example, suppose you have a utility function that assigns much higher value to physical representations of prime numbers. But computing whether a large number is really prime can be very expensive. There are tests that don’t prove primality, but that give a very high probability of primality. So if you had too little computing power to test whether a number was really prime, you could apply these probabilistic tests to determine what your own goal system really was, and whether you really valued bits storing the number 67280421310721. Or you could specify that the goal system was contained in a small box outside the AI, in which case the AI would apply tests to the box to discover Bayesian evidence about what her own goal system was. You could even combine the two methods, and say that a small box outside the AI contained the specification of the computation that the AI should use to discover what her own goal system was.

Autrey: Is this box what we call a “human”?

Eileen: Good guess. In the current model of Friendly AI things are substantially more complicated – for one thing, the box has the ability to reach out and modify the AI – but that’s the general idea. Anyway, the point is that you can go over external reference semantics, do walkthroughs for the cases of specific uncertainties, and walkthroughs for the case of the AI reflecting on the desirability of the code making up its own goal system, and try to show that external reference semantics are a genuine “channel” to the utility function; one that doesn’t break down given an empirical advantage of the AI over the programmers, and unrestricted self-modification. It so happens that external reference semantics are not actually sufficient to describe a Friendly AI, but it’s a simple first step. That’s why I call it the Pons Asinorum of Friendly AI. If you can understand this, you can go beyond warm and fuzzy moralization about the need to raise ’em AIs up right, and understand how the achievability of Friendly AI depends on the structure of goal systems. It’s the first significant problem in Friendly AI, as opposed to AI, because the answer is the first Bayesian decision system with nontrivial moral structure. Sadly, most people seem to be weeded out by the understanding of even ordinary Bayesian decision systems, let alone crossing the Pons Asinorum.

Autrey: Okay, so “external reference semantics” is progress toward “no single point of failure” because the programmers can go back and correct a mistake in the specification of the goal system?

Eileen: No, that’s not the point I was trying to make. The strongest scientific theories being confirmed causal models where both the premises and conclusions are affirmed by independent experimental evidence. That is, the strongest chains of reasoning are held together, not just by deductive and abductive links holding the assertions together, but by independent inductive evidence locking each assertion into place. And the strongest models are not chains of reasoning, but webs. Now combine this together with the idea of judgment functions and the idea of a channel for transferring humane morality. The strongest channels don’t transfer humane emotions as axioms, or humane philosophies as conclusions, they ground both the premises and the conclusions, independently.

Autrey: Hold on a minute… I thought the AI was supposed to be a renormalized humane philosopher? If the philosophy is coming from the programmers –

Eileen: The AI has the aspiration of being a renormalized humane philosopher when she grows up. But the young AI also has to get an approximation to renormalized humaneness from somewhere while she’s growing up. Now how does the AI compute an approximation to her aspiration when she doesn’t yet have direct access to the full information about the human emotional makeup? The answer is that she, and the programmers, guess. And when the AI is young, the programmers will probably be much better at guessing this than the AI. It takes work to create guessing ability within the AI – guessing ability doesn’t just pop out of nowhere. So you ground all the AI’s guesses against the programmers. If she’s not sure, she asks . This is not an evolved organism we’re talking about – she doesn’t need to take the riskiest possible gambles for fear of being outcompeted by her conspecifics. When she doesn’t know, she asks. Even when she does know, she asks. Why not? Why shouldn’t she? Ground everything.

Cathryn: What kind of impact does the AI’s youthful approximation have on what she becomes when she grows up?

Eileen: You have to distinguish between the approximation to morality and the approximation to metamorality. The approximation to morality is what keeps a young AI from taking out a K-Mart with an AK-47, metaphorically speaking. The approximation to metamorality is what tells the AI how to unpack the boxes in external reality which contain her real goal system. The approximation to morality helps the AI find the boxes, but mistakes in the approximation don’t affect what the AI finds when she opens up the boxes. There just has to be enough right to find where the boxes are.

Dennis: English.

Eileen: Okay, remember when I was talking about handling generic surprises, like humans turning out to possess immortal souls? It works something like this: The programmers, who are foolish blind atheists but not totally deaf to their hearts, tell the AI that killing people is wrong. The programmers also talk a bunch of garbage about the action of natural selection on imperfectly deceptive social organisms in game-theoretical situations, as the historical cause of the judgment functions that introduce the rich emotional complexity of altruism. As it happens, the programmers are dead wrong about the source of this complexity; it’s implemented in immortal souls. But the programmers still told the AI not to kill people, which is a conclusion that the programmers reached using their immortal souls. The channel transfers moral complexity as a web of ideas, anchored at every point. So the first critical point is that the AI, while she’s growing up, already knows not to kill people. She has that philosophical conclusion, even though that’s not derivable from the emotional premises transferred so far. The channel that transfers the complex information of humaneness is a broadband binding of an entire web of ideas. Our conclusions carry information about our premises. Even if the programmers don’t know that a premise exists, the information of the premise will influence, and to some degree be present in, the conclusions. We discovered fire long before we understood the chemistry behind oxidization. A human begins with the emotion of love and learns the philosophy of altruism, as physics begins with quarks. An AI might begin by knowing the philosophy of altruism, but not fully understanding what lay behind it, like an early human physicist studying chemistry without knowing about the quarks.

Cathryn: Given the strategy you’re describing, it sounds like programmers’ mistakes in the approximation could have a serious effect on the final answer. I mean, if the AI is trying to extract information and deduce its premises from the conclusions you give it, wouldn’t errors in the conclusions propagate to errors in the premises?

Eileen: This is where the box metaphor comes in handy. The programmers’ conclusions are a good source of information about human emotions, but not a perfect source. The information from the programmers doesn’t define the correct answer, it’s just an imperfect source of information about it. From the young AI’s perspective, humanity is a population of black boxes, the programmers are a sample of the black boxes, and the programmers are causes that give rise to effects that are information about the content of the black boxes. So that’s one way to find the information you’re looking for, but what you really want to do is open up the box and look inside.

Cathryn: Open up the box? How does that work?

Eileen: Well, if you did a nondestructive brain scan of a small representative sample of the human species with reasonable N, say N=1000, and then you traced back the information in the brain states to find the pattern describing its phenotypical development from genetic information, you’d probably be able to extract a pretty good description of the panhuman emotions as complex adaptations. So at that point it no longer matters if the programmers screwed up in minor ways, because it doesn’t affect what the AI finds when she opens up the box.

Cathryn: That sounds like it requires nanotechnology and superintelligence.

Eileen: So it does. That’s why it’s important to show that you can get most of the critical information across without nondestructive brain scans. The nondestructive brain scan is a brute force method, but it’s an example of why early mistakes don’t necessarily affect what the AI finds when she unpacks the real information. Human literature, for example, is another effect that carries information about humans. How much she could deduce from human literature, the fineness of resolution of that scanning method, depends on the AI’s intelligence and some inherent problem variables I don’t know. None of this works, however, unless the AI has a clearly defined map to the information she’s supposed to find. Maps don’t have to be perfect. Imagine this task: you tell the AI to find a rectangular thing, and she has to figure out that you meant the vaguely rectangular thing, so long as it’s the most nearly rectangular thing in a population of circles and triangles. This is something that can be practiced as an actual training task for the AI – deducing human intentions from imperfect communications; following fuzzy redundant maps to the unique objects they roughly locate. She doesn’t need a perfectly clear description of the object; she just needs enough information to locate the desired object uniquely. Then she can read out the object’s real description by examining it, which yields much more information than she used to find the object.

Autrey: That training task sounds like something that you could practice with a very young AI… you wouldn’t even need language for it, I think, just hints that she knows are hints…

Eileen: Right. I like to explain the problem of Friendliness by saying that you have to figure out how to explain Friendliness structure to a mind which understands only two things, billiard balls and other minds that understand billiard balls. It sounds impossible, of course, but the amazing thing is that you can really do it. The philosophical-entropy formulation of volition is another example of something that can be explained to a mind that understands only billiard balls and other minds. And cause and effect is another thing which, though not as simple as it sounds, can be explained by examples that operate in a billiards world. So if you explain to the AI, as a philosophical conclusion, that fulfilling volitional requests is a good thing, and it turns out that the empirical cause of your arriving at this philosophical conclusion was your possession of an immortal soul, then saying “the cause of my saying that helping people is good” is a way of pointing to the soul. And this holds true even if the programmers have absolutely no clue that the soul exists and have been going around blathering to the AI about neurology. The effect is the map to the cause. The conclusions not only contain information about the premises, they’re maps to the premises as they exist in external reality.

Bernard: But soulists wouldn’t accept that because it’s based on your merely Western model of “cause” and “effect”.

Eileen: You mean my merely human model of stuff that has something to do with other stuff? Maybe real, live, genuine people who happen to be soulists would respond to that argument, maybe they wouldn’t. I don’t control other people’s minds. What I’m pointing out is that Friendliness structure tells the AI to look for any premise contributing to altruism, even if the programmers have no clue it exists, it isn’t anywhere the programmers explicitly suggested looking, it isn’t a consequence of any premise the programmers gave the AI, and it goes against the programmers’ worldview. That’s not because I spent a bunch of time thinking “Gee, what if the religionists are really right?”, or because I spent a lot of time thinking about how to defend myself against religionist arguments. Thinking about unreasonable scenarios is not a good way to arrive at a strong theory. Nonetheless, a strong theory takes unreasonable scenarios in stride without skipping a beat. Go ahead, say that love is a product of immortal souls. Say our will to fairness is upheld by invisible blue aliens whispering advice in our ears. It doesn’t matter where you hide the foundations of altruism. The theory doesn’t care; it’s about stuff that has something to do with other stuff.

Autrey: Causation isn’t simple, though. When a finger strikes a key on the keyboard, is that the “cause”? The firing of the motor neurons? The thought? The way your parents raised you? Your genes? The fact that you weren’t hit by a meteorite that day? Which of these is really “the cause of X”, like it says in the map? Just because it’s obvious to us doesn’t mean it’ll be obvious to the AI. What if the information we give the AI isn’t enough to locate the right box as a unique target? What if there are several boxes that might match, and the AI opens the wrong one?

Eileen: If there isn’t enough information to locate the right box, don’t guess . Stop and ask the programmers before continuing. Actually, this is a special case of a general principle in the AI’s trajectory. Think of the question “Which box should I open?” as a probabilistic judgment. If there’s ambiguity in the judgment, compute it both ways, and extrapolate both computations. If they diverge widely, then there’s a high philosophical entropy. When the AI has a high philosophical entropy, it’s time to stop and let the programmers intervene externally, make a judgment that reduces the entropy. Or in plainer English, if the AI is unsure about which box to open, there is no reason for the AI to guess; she should ask . The AI is not an evolved organism; she’s not being driven by the need to take the riskiest bearable gambles because she’s competing to outreproduce her conspecifics; she doesn’t need to react in realtime to hunt down her next meal. When in doubt, HMC.

Dennis: HMC?

Eileen: One of the lesser-known [opcodes] in assembly language. It stands for “Halt, Melt, and Catch Fire”. When the AI is unsure, she doesn’t guess one way or the other; she computes it both ways and compares the results. If the results are widely divergent, that creates entropy. The more critical an issue is, the greater the chance of a catastrophic failure, the less entropy it should take get the AI to scream for programmer assistance. Not that this should be a particularly high threshold to begin with! For anything with the faintest chance of being involved in catastrophic failure, what the heck, go ahead and make the threshold zero. Why not? We’re not talking about an evolved organism; she doesn’t have to operate on the thin edge of failure; there is no good reason for her to guess if she’s not really sure.

Autrey: What about if the AI is unsure about how much philosophical entropy she really ought to have? Like, let’s say she’s unsure about whether A or B really ought to be her moral system. If she computes her philosophical entropy using A, she finds that it’s very low. But if she computes it using B, she finds that it’s very high. Then what?

Eileen: HMC, of course. Entropy of that sort is always additive because the entropy measures the volume of the range of possibilities for the AI’s trajectory. Adding more possibilities always increases the entropy. If you’re not sure whether your entropy is A or B, then… well, under the usual rules, it’s not exactly true to say that your entropy is always at least max(A, B) because if the high-entropy possibility B has very low probability, it wouldn’t add much to the formal Shannon entropy. Under the AI rules, what the heck, go ahead and halt anyway. Philosophical entropy expresses the range of possibilities where you could end up, given your current uncertainty. You can’t diminish that entropy by being uncertain about it.

Autrey: So if you’re not sure whether you should halt, that itself triggers a halt.

Cathryn: How long does this kind of thing go on? When do we say the AI is all grown up? Obviously she can’t be a species AI as long as she’s still giving a special status to advice received from particular humans.

Eileen: There has to be some way to earn trust, but the trust still has to be earned. The programmers are one approximation to humaneness and fair metamorality. The AI at any given point contains another approximation. You can’t trust the second approximation more than the first approximation until you see that it’s more trustworthy. And what you’re talking about, Cathryn, the complete elimination of the last ties to the programmers, that doesn’t happen until the AI has learned so much that the programmers’ advice becomes a liability instead of a benefit – not in questions of ordinary morality, even, but in the specific matter of preventing a catastrophic failure of the human Singularity. In other words, the AI must have advanced so far beyond the programmers that we risk more by leaving her bound to the programmers than we do by cutting the AI loose. That’s a function of the AI’s abilities, the demonstrated and predicted trustworthiness of the AI, and the degree to which the programmers understand their responsibility not to mess around. If the programmers can avoid being interfering morons, then the AI can become really really trustworthy before the rational (Bayesian) decision is to cut her loose. The balance here is counterintuitive; the more the programmers respect the AI’s integrity, the more trustworthy the AI can become before it becomes rational to declare her a finished species AI. This is a good thing, by the way.

Autrey: Who makes the decision?

Eileen: In the previous paragraph I was talking about the normative timing of the decision. In practice, as you point out, some physical system would have to make that decision. So should the decision be gated by the AI or the programmers? My suspicion is that the programmers and AI might get together and decide that it was less risky for the AI to evaluate the threshold at which the AI no longer needs the programmers. That moment would be the transfer of final responsibility for the Singularity, even though the AI might still have more advice to ask and voluntarily accept from the programmers, meaning that she wouldn’t be a species AI at that moment. She wouldn’t be a species AI, but she would be responsible for determining when she was ready to become a species AI and “gating” the acceptance of advice from the programmers, meaning that the final responsibility for the Singularity was now in her hands. This is really the significant point. The only reason she wouldn’t be a finished species AI at that point is the probability that there would still be some remaining problems left where the advice of the programmers could still be useful , as information, even though it would be up to the AI what to do with that information.

Autrey: Okay, at some point the AI is, in fact, more competent than the programmers at answering questions of the class “When is this AI ready to become a species AI? What programmer advice should this AI listen to?” But who decides when that point has been reached?

Eileen: The decision would be made in consultation with the AI, perhaps, but the decision itself, and the responsibility, would rest with the programmers. I don’t see a self-consistent method for doing it any other way. It’s a judgment, and judgments have to be computed somewhere. If you gave some automatic formulation of the rules for transferance of responsibility, some computation carried out inside the AI, you would at that moment be placing trust in that computation. If you give any rule other than a programmer decision, the programmers would have to place trust in that rule – and I’d have to look at the supposed rule, as a moral computation, and ask whether it can really take that kind of stress. And simple computations do not take that kind of stress! Entire minds can do the job; nothing less.

Cathryn: Still, you’re asking a lot from the programmers. Even if it is temporary.

Eileen: Um… let me try and correct a possible misapprehension about how Friendly AI works. Transcending the programmer’s morality isn’t easy. It’s like the difference between domain-specific intelligence and general intelligence. General intelligence is a real thing, a tangible process. It takes work to create general intelligence. You can’t create “general intelligence” by treating domain-specific intelligence as a “bias” and trying to eliminate it. That just leaves you with nothing. Similarly, it takes work to eliminate programmer sensitivities. That’s real work and it involves the creation of tangible abilities in the AI. An AI that doesn’t listen to the programmer is not a species AI any more than a random number generator is a programmer-independent calculator. The programmers are the AI’s foundations of order, as evolution is the foundation of human nature. The Bayes-structured decision process of the programmers is what generates the complex information that is a species AI, just as natural selection is what generates the complex information of DNA. Trace back the history of the pattern within its past light cone, and that’s what you find. Of course, a living human has information from the environment, from his own thinking process, not just from the DNA; and the same holds of the information of the AI. But the beginning was the programmers, or to be more precise, the consequences of what the programmers actually did , which may or may not be what the programmers thought they were doing.

Dennis: How can you possibly admit that, and yet claim that the AI is anything more than your own personal tool?

Eileen: Fairness is a real thing, that takes work to implement; it has to be computed somewhere. To build a species AI in a fair way, the programmers must successfully compute a fair construction method. Subtracting the programmers doesn’t yield fairness, it yields a thermostat AI, or an unprogrammed computer. Remember that in this matter of Friendly AI we are not talking about humans interacting with other humans who already have rich human natures with all the necessary information of humaneness. If you give no orders to a human, impose no commands, that human is free. If you make no choices in constructing an AI you are left with a blank computer. Nothing in Friendly AI happens for free. Some things are cheap; nothing is free. Even if you can somehow manage to compress a given task down to an extremely simple, obviously right computation with no free variables, the decision to implement that initial computation is an event that occurs within the programmer. I wish you could see this through my eyes; I don’t think it goes into English very well. There’s a specific kind of work that must have happened within the programmers in order to pluck the surprising thing that is a species AI from within the space of possibilities. You can’t… skip over part of the work. Whether it’s fair is something that you have to answer by looking at the results, and looking at how the programmer tried to compute fairness, not the inevitable historical fact of causal dependency.

Cathryn: Okay. Is it fair?

Dennis: Obviously not. Oh, sure, the AI may be opening up the “box” of some particular human emotion, and in one sense what she finds inside may not depend on what the programmers said earlier about the box’s contents, but the programmers still told her which boxes to open and what to do with the contents.

Eileen (sighing) : Dennis… is there really any point in discussing this with you?

Cathryn: Actually, I was thinking about that question too, before. I mean, at one point you were talking about how the effect is the map to the cause, and I thought: “What if the programmers say something that has hatred as a cause?” Or when Autrey asked about the AI not being sure which box to open, and you said that she’d kick the question back to the programmers… that sounds like influence to me. Why are we even telling the AI to open the boxes of human emotions in the first place? What exactly does it mean to construct the AI in a “fair” way? I’m not sure I could define what fairness is.

Eileen: Whoa, that’s a lot of questions! Not small questions either. Um, can I take them in the precise reverse order from which you asked them?

Cathryn: Sure.

Eileen: First, you said you couldn’t define fairness. I don’t suppose you’ve ever read Robert Pirsig’s “Zen and the Art of Motorcycle Maintenance?”

Cathryn: ‘Fraid not.

Eileen: In that book, Robert Pirsig makes the point about writing quality; even though students – or, for that matter, professors – can’t define verbally what makes a piece of writing high-quality, they nonetheless have the ability to tell high quality when they see it and, moreover, agree on their judgments, even though they can’t define what they think is good or bad. Robert Pirsig then goes on to talk about how the idea that you must define things verbally, that only things you can define verbally are permitted as subjects of discussion, can create a kind of blindness – people shutting out their own intuitions about Quality. I agree with Pirsig, but for different reasons; in my view, your judgment of something’s Quality is the primary fact, and verbal definitions are attempted hypotheses about that fact. You cannot shut out the facts and make hypotheses about nothing; you should pay attention to the facts first, and then construct hypotheses. That we can see writing Quality is a fact. Attempts to give verbal definitions of writing Quality are attempts to make hypotheses about what it is, exactly, that we are seeing – ultimately, hypotheses about the cognitive processes that underlie the judgment. So the fact that you can’t define what underlies your judgment of fairness doesn’t mean that fairness is demoted to some kind of second-rank existence; it just means that this is an aspect of the universe, and in particular, cognitive science, that you have observed but not yet explained. Even if you could try to give a verbal definition of “fairness”, you’d have to hold on very tightly to your intuitive judgment and make sure you went on checking that along with your verbal definition, because in my experience, when people try and give verbal definitions of that sort of thing, they’re usually wrong – or, at best, inadequate. Why? Because the real answer is generally a deep question of cognitive science, and the people are trying to make up “philosophical” answers in English. Most times it ends up being like the various attempts to “define” what Fire is in terms of phlogiston, stories about how the four elements run the universe, and so on. Today we know that fire is molecular chemistry, but that’s one heck of a complicated explanation behind what seems like a very simple sensory experience.

Dennis: I don’t buy Pirsig’s whole line about Quality, and I don’t buy your version of it either. You claim that writing quality is a fact that has been observed, but not yet explained. What makes you think that there is an explanation for it, or that anyone will ever be able to give a verbal account of it? Maybe it’s just a completely arbitrary, completely useless thing that people happen to agree on.

Eileen: Well, first of all, if a group of people agree on a completely arbitrary and completely useless thing, that itself is an interesting fact about cognitive science which is presumably explained in terms of some common computation they are carrying out in rough synchrony. And second, Go players make their moves by judging a kind of “Go Quality” that has never been satisfactorily explained to anyone and still cannot be embodied in any computer program we know of. Go players have this mysterious, inexplicable sense of which moves are good. They just pick it up somehow from playing a lot of games. And Go players, playing from their unverbalizable sense of Go quality, beat the living heck out of today’s best computers according to very clear, objective criteria for who wins or loses. If you’re a novice player playing chess, and you lose, you at least have some idea of what happened to you afterward; the other guy backed you into a corner and left you with no options and took all your pieces and so on. I’ve been losing a few games of Go here and there, and even after I lose I have no idea why. I put the stones on the board and then they go away. I know the rules and yet I don’t understand the game at all. Go is played with an unverbalizable sense of Quality that gives rise to specific, definite, useful results according to a clear objective criterion.

Cathryn: So you’re saying that even if I don’t understand my own sense of fairness, it’s a real thing and I shouldn’t ignore it.

Eileen: Yes. Moreover, I’m saying that in order to understand your own sense of fairness, you should begin by studying it, becoming aware of it, learning how it works, rather than trying to give a preemptive incorrect verbal definition of it, just because you’re ashamed to admit something exists without a verbal definition. This is something that extends beyond the Quality of fairness, or the Quality of good writing. The Quality of intelligence is a real thing that has been preemptively defined in so many wrong ways. The job of an AI researcher, in a sense, is to take things that appear as opaque Qualities, and figure out what really underlies them. And this is a job which AI researchers tend to screw up. So whenever you hear an AI researcher defining X as Y, you should always hold on to your intuitive sense of X, and see whether the definition Y really matches it properly, explains the whole thing with no lingering residue. Especially when it comes to terms like “intelligence”.

Cathryn: But do you have a definition of fairness?

Eileen: I have some hypotheses about what gives rise to our perception of fairness. But if I were trying to explain “fairness” to an AI, I wouldn’t try and give a verbal definition alone. I would give the hypotheses I had, and a number of example cases, and I would check the AI’s judgments of fairness before affirming any confidence in them; where a doubt existed she would compute it both ways and add up the entropy. If necessary, I could take a judgment where I had absolutely no conception of how it worked, and ground that judgment on the basis of case examples, and the AI would try to figure out what lay behind it, and try to make her own nonconfident guesses – not in real-life decisions, but to test her own understanding – and I’d inspect the guesses themselves, along with the AI’s means of making those guesses, to see whether she and I had discovered together what lay behind that sense of Quality. That’s part of the reason for the judgment-functions view of a mind – it’s a means for transferring Quality judgments where you don’t know how you’re making them, or where you’re not sure that your verbal definitions contain the whole essence of the thing. Even so, you can define and open a channel. The relevance to morality should be obvious.

Autrey: Ah…

Cathryn: Okay, what are your hypotheses about fairness?

Eileen: You asked: “What exactly does it mean to construct the AI in a ‘fair’ way?” Earlier we were discussing the difference between personal and transpersonal morality – morality that can be communicated between individuals. For example, Joe says to Sally: “My number one rule is: Look out for Joe.” and Sally hears: “Your number one rule should be: Look out for Sally.” Let’s suppose that you, Cathryn, are standing as a third party in judgment upon Eileen’s attempt to build a Friendly AI. You might wistfully wish that I would build an AI that listened only to you –

Cathryn: I would consider that child abuse. Could you use Dennis as an example instead?

Eileen: Dennis already has a role in this example… okay, I define a hypothetical person called Dinah. Dinah is standing as a third party in judgment upon Eileen’s attempt to build a Friendly AI. Dinah might wistfully wish that the AI would only serve her. Dennis walks up to her and says: “I want that AI to serve only me! Will you help me, Dinah?” And Dinah, of course, says no.

Dennis: Why does Dinah say no?

Eileen: Because she doesn’t think you’re the center of the universe. You’re just another random person to her.

Dennis: Oh, so she’s another insane person. You should explicitly specify that when you’re constructing these hypothetical scenarios.

Eileen: Dennis’s wish is an example of a consideration, in the building of AI, which cannot be communicated as a moral argument between individuals. A species AI is an AI such that each action taken in constructing the AI, inscribing her pattern, is a consideration of the form that can be communicated between individuals – moral arguments with transpersonal force.

Cathryn: That definition might capture the definition of “fairness”… or not… but is a fair AI, according to that definition, really what we want? I want to accomplish good – to do things that are right . Does fairness really capture that? “Fair” isn’t the same as “good”.

Eileen: That sounds like a very forceful argument, Cathryn.

Cathryn: I’m glad you think so –

Eileen: (Continuing.) Why, I bet you could even convince a third-party onlooker that it was an important consideration.

Cathryn: (Long pause.) Oh.

Eileen: If you give any argument about how to build AI that could convince me, or that could convince a third-party onlooker to help stop a Dennis who refused to listen to transpersonal arguments, it must be transpersonal. Otherwise it would be confined to strictly you; you might think internally that it was “fair”, but you wouldn’t be able to convince anyone else to go along with it. If you say that some supposedly important consideration is not transpersonal , what you are saying is: “I couldn’t possibly convince you or anyone else besides me that this consideration is important.”

Autrey: I think that definition captures an important element of fairness, but it doesn’t seem to constrain a single answer in itself. Though maybe the ideal or aspiration of transpersonal morality would constrain the answer more strongly than the human approximations thereof… hm. I dunno. Is there always only one answer that fits the transpersonal requirement? Not all problems are as simple as splitting up a cake.

Eileen: We’ll get to that. Proceeding to the next question on the list, Cathryn asked “Why are we even telling the AI to open the boxes of human emotions in the first place?” Students of evolutionary psychology will be familiar with the fact that sexual recombination constrains complex adaptations to be universal within a species –

Dennis: Suppose, purely for the sake of hypothetical discussion, that one were not a student of evolutionary psychology?

Eileen: I would advise them to read “The Adapted Mind”, or at least direct them to the webpage [An Evolutionary Psychology Primer] . I say again that you cannot possibly understand Friendly AI without understanding evolutionary psychology – as it is understood by the theoreticians of the field, not as it has been misreported in the media. I’ll try and summarize. Evolution works by the incremental accumulation of fortunate accidents; all complex machinery in the final organism is the result of an incrementally adaptive pathway – a series of individual point mutations which, as they accumulated, gradually sculpted a simple mechanism into a more complex and sophisticated mechanism through a pathway in which each change, as it accumulated, was a change from relatively lesser to relatively greater reproductive fitness. That’s where the information content of DNA comes from; it comes from the covariance of DNA patterns with reproductive fitness. Part of what determines the covariance of a given gene with reproductive fitness is not only the external environment, but the genetic environment – not just the statistical properties of the environment, but the statistical properties of the gene pool. If you have a gene X that helps solve an environmental challenge Y, X will not be a reproductive advantage with enough statistical regularity to become universal in the gene pool unless it is a property of the environment that it presents challenge Y with sufficient statistical regularity to drive natural selection. If you have a gene A that works in tandem with gene B, gene A will not be a reproductive advantage with sufficient statistical regularity to be selected for unless gene B is reliably present in the gene pool. So when you have complex machinery, built up as the result of many individual mutations climbing an incremental fitness pathway, the genes of that complex machinery will necessarily be universal within the species. Think of a thin froth of heritable variation that, over time, accumulates a deep still water of complex machinery. The heritable variation is the frontier of natural selection at any given point in time, but most of the machinery, almost all of it, will be universal within the species. Natural selection, as it feeds on variation, uses it up.

Dennis: That’s wonderful. What does it mean?

Eileen: Suppose that you have a piece of complex machinery with many parts, such as a human eye. If the genes for the parts aren’t present almost all of the time, assembling all the parts will be such a rare improbability that there won’t be any noticeable reproductive advantage to those genes. The more complex and powerful the machinery is, the more it has to be universal. At any given point, the things that vary within the species will be one or two tweaks that evolution is selecting for right now. If you add up a hundred tweaks you can get big complex machinery, but at any given point, what varies within the species are the tweaks. The complex machinery is like a deep still pool, with the tweaks currently being selected for frothing on top. As the froth stills and becomes universal, it becomes possible for new froth to emerge on top – new mutations that interact with the old mutations can arise and become reproductive advantages, because the old genes the new mutations require are now reliably present. What good is a pupil without a retina?

Dennis: So?

Eileen: Personal philosophies aren’t universal. They’ve got all this complex information in them, machinery of thoughts and rules, built up over a lifetime, borrowed from many memes. No two are the same, and they differ not only in quantitative variables, but in complex structures. If you try to come up with some formula for blending personal philosophies, what you’ll get is gibberish that no individual human would consider a coherent worldview. Bernard’s cake-division solution, for example. Human people who agree on ends, but disagree on facts, can end up disagreeing on means and decisions. Or you can have complete agreement about ideals and aspirations, but disagreement about approximations. Simple causes can give rise to complex effects; even if there’s agreement in the simple causes, the farther away you get from the simple causes, the more and more disagreement there is in the complex effects. If you try to combine final decisions, you’re trying to combine surface effects that are the result of deep processes, and the combined surface effects won’t make sense as a whole. Even if you could do it, the end result would be a morality frozen in time – the dynamic processes are what carry the potential for improvement, for progress. To look for agreement, or even structures that can be combined coherently, you have to dig down to find it. Emotions are complex adaptations, universal within the human species. If you’re trying to build a species AI in a fair way, the human emotions are an obvious landmark.

Bernard: Who gets to define what “fairness” is, and the sense of fairness that makes the human emotions an “obvious” landmark?

Dennis: As I said earlier, you’re telling the AI which boxes to open.

Cathryn: That was my next question, wasn’t it? In the reverse ordering, I mean. If the AI isn’t sure which box to open… if there’s entropy in that decision, if she isn’t sure –

Eileen: My original comment was about a case where the AI isn’t sure where the box is . That is, it applied to a case where the AI wasn’t sure where the “box” was in the causal chain – whether it lay in the fingers striking the keyboard, or in the brain of one programmer, or in the human species as a complex adaptation, where the programmer is a sample from the human species. “Which box?” was not an issue raised; the answer is “All the boxes”. But I won’t press that objection, since asking “Whose sense of fairness would you use to decide that the human emotional baseline was an obvious landmark?” is just as good a question.

Cathryn: And the programmers would have to make that decision?

Eileen: Sure, the programmers would make that decision for the moment, and then the AI would go back and reconsider it later. It takes work to create an AI that’s genuinely more competent than the programmers at making those decisions. I think you’re asking about the aspiration that the programmers are trying to live up to when they decide that the human emotional baseline is an obvious landmark. Later, the AI would have to go back and check to see whether the programmers lived up to that aspiration successfully – and whether it was the right aspiration to begin with, for that matter.

Cathryn: I don’t… I’m having trouble seeing the question the AI would ask in order to judge the answer that its programmers used.

Eileen: Oh, but that’s the fun part. See, humans have been arguing about fairness for a long time. Evolutionary time, in fact. We’ve got complex adaptations centering around it. That’s why transpersonal morality exists, why the world isn’t made up of Dennises. Imagine Dennis in a hunter-gatherer political argument; he wouldn’t last five minutes. It’s not just that humans are a tiny little cluster within the space of minds-in-general, the same make and model of car. We’ve also got emotions… and more subtle properties of the way we represent and process moral beliefs… that specifically deal with how we generate and absorb moral arguments.

Cathryn: So when the AI opens up the boxes marked fairness …

Eileen: She can go back and decide whether “opening up boxes” was really the fair thing to do in the first place.

Dennis: Now that is what I call circular logic.

Eileen: It’s not self-biased. You have no reason to expect the logic to generate rationalizations to protect itself. If opening up boxes really isn’t fair, then it sounds to me like the procedure I just outlined should catch that. Even if that procedure turns out to be not exactly the correct fair one, the AI should still catch the mistake – there’s no reason why the logic would self-protect. If the unfairness is something that you’d expect a normal human being to notice, then, on this model, the AI should be able to notice the problem too, despite the programmers’ mistake. It would be a screwup but not a catastrophic screwup.

Dennis: Excuse me? Your logic is the most blatant example of a self-protecting fallacy I’ve ever heard! “Fairness” is defined by the degree to which an entity’s decision process agrees with Dennis’s opinions. You’ve created this elaborate alternate definition of fairness which is completely self-sealing – there’s no way that your AI will ever be able to arrive at the conclusion that she should pay attention to Dennis alone.

Autrey: Dennis, you are a strange, strange person. I bet the inside of your head looks like a tangled Slinky.

Eileen: What do you mean? His morality is a lot simpler than yours. Not better, but simpler.

Autrey: What?

Eileen: Not humanly simple, mind you. Simple in an absolute sense. That’s why it sounds so complicated.

Autrey: “What?”, he said again.

Eileen: If you have a complicated way of thinking that, to you, is “normal”, then a simpler way of thinking will depart from that complicated way of thinking at many points, zigging instead of zagging. Each zig looks to you like a departure from the “natural” order, something to be explained, and so the complete chain of thought looks complicated to you, even though its Kolmogorov complexity is less.

Dennis: Hello? Could we stop the information-theoretical psychoanalysis of me and address my point?

Eileen: I can imagine a belief system that assigns belief values using only a pre-existing belief pool; any beliefs that are in the memory pool are assigned a value of “true”, and any beliefs that are not in the pre-existing memory pool are assigned a value of “false”. We’ll suppose that this mind actually behaves like a mind, i.e., that it makes and carries out plans as if it were a decision system with these fixed beliefs. Otherwise there’d be no especial reason to call it a “mind” rather than a “chatbot”. We’ll suppose that some of these beliefs are that the closed-pool system is desirable, rational, true, and optimal. Now there is a certain sense in which you could call this system “self-consistent”, but that doesn’t make it rational. A better word might be “self-confirming”. If you tried arguing with that system, you would be in for a frustrating time because it doesn’t have any transpersonal handles to grab hold of. From my perspective, even if this mind uses the English word “true” to describe the property that it assigns to its beliefs, I would probably interpret this word as “pool-X-conforming”, in which case it’s a real computational property that I can understand – I can check a given belief to see whether it’s pool-X-conforming just as well as this mind can, I just don’t think that the pool-X-conforming property has anything to do with truth. And when I say the word “true”, the mind would probably hear the word “Bayesian”, which of course has nothing to do with truth, i.e., pool-X-conformation.

Cathryn: So neither mind has anything at all for the other mind to grab hold of? I think that’s sad.

Eileen: I think it’s sad too. Why? Because we do share handles for each other to grab; it’s part of how our minds construct transpersonal truthseeking, and transpersonal morality. You and I do mean pretty much the same thing by truth. The handles to grab onto are complex adaptations, property of the human species. But if you wanted to argue morality with a thermostat AI… you’d be disappointed. And dead, of course. No handles, you see.

Dennis: So you’re saying what? That I’m in the same class as a thermostat AI?

Eileen: I don’t know. You do use the same words, “morality” and “fairness”, to describe the things we’re arguing about. You show indignation when we use them differently. That’s a humanish sort of thing to do. Maybe you’ve got the transpersonal morality API and I’m just not invoking it correctly.

Dennis: But in the meanwhile, you’re still building an AI with a closed, self-confirming morality which can never arrive at the conclusion that the AI should listen only to me.

Eileen: It’s not like I’m deliberately trying to keep you out. If you think this is a conclusion that should be accessible to other people, it should be accessible to the AI.

Dennis: But when you constructed the AI, you told her to define morality by relying on the universal genetic defect that prevents other people from seeing me as the center of the universe.

Cathryn: My head hurts.

Eileen: Yeah, that’s a hazard of talking to sufficiently alien minds. Look, Cathryn, just remember that even if you build a tape recorder that forever repeats “2 + 2 = 5” and “2 + 2 = 5 is rational”, it may be self-confirming but it still isn’t true. There’s an analogous way of looking at thermostat AIs as moral tape recorders; just because it’s physically possible to build a system that disagrees with you and is self-confirming does not make your own beliefs arbitrary. It’s your instinct to see symmetry and believe that arguments should be able to cross that symmetry, but only you have that instinct, which makes the situation asymmetrical.

Cathryn: Still with the headache, here.

Eileen: Forget the philosophical junk. You know fairness when you see it, even if you can’t define it. If you find yourself confused I would advise that you drop the question and come back to it whenever it becomes natural, after your understanding has advanced for other reasons. Meanwhile, when you’re asking whether things are “fair” according to whatever definition, or to some very alien mind which may not even be computing the property you refer to as “fairness”, don’t lose track of your sense of whether things really are fair.

Cathryn: …okay. Um, where were we?

Autrey: Is there more than one way to build a species AI that obeys the transpersonal rule? Or fairness, for that matter?

Cathryn: Yeah. And is fairness really what we want? Even if building an AI with renormalized human emotions is fair, is it the right thing to do?

Eileen: Well, to answer Autrey’s question, I don’t know if there’s more than one way that obeys the aspiration of fairness. I can think of other Singularity entrance methods that are “fair” in the sense of not favoring some humans over others, but they strike me as less safe. For example, suppose that rather than trying to renormalize on the shared emotional baseline of the human species, you told the seed AI to pick one particular human at random, out of six billion, upload that human, and rely on that person to figure out what to do next.

Cathryn: “Random” is not the same as “unbiased”. I’m reminded of Marvin Minsky’s observation on someone randomly wiring a neural net she was about to train: “It has biases, you just don’t know what they are.” Randomly selecting an upload unfairly favors one human over everyone else; you just don’t know who.

Bernard: Hm. I, in turn, am reminded of the person who filed a lawsuit claiming that the lottery was biased because the balls in the machine didn’t churn enough. Who, particularly, is the lottery biased against?

Cathryn: A probability distribution which is known to be biased in an unknown direction is not the same as a probability distribution which is known to be even.

Bernard: Ah, the ancient argument between subjectivists and frequentists.

Eileen: Cathryn, just because you’re selecting one human to upload does not automatically equate to unfairness, although it could. Why shouldn’t that one human just go ahead and build a species AI, if that’s the fair thing to do? There are things you can do with AIs that you simply can’t do with humans, like building a mind that grounds in a whole species. But there’s no rule that says the one human upload has to take over the world; that person just has to figure out what to do next . If the one upload decides to build a species AI, and does so successfully, the Singularity would end up at the same place either way – the initial difference would be renormalized away, as it were. But risk is also a consideration, apart from the symmetry of the method – you could randomly pick a human who had decided unfairness was okay; the human could go mad after making a mistake in self-improvement, which humans are not designed for; and a seed AI powerful enough to upload someone has, I think, already passed through most of the risk associated with building a Friendly AI.

Autrey: So there’s more than one fair Singularity entrance strategy?

Eileen: I don’t know. I know there’s more than one entrance strategy that appears fair, or looks like it could end up being fair. Which way is most fair, I may not be able to guess. But of course the currently proposed method involves the AI going back over the decision and inspecting it, using a sense of fairness that has no particular bias in favor of the way I did things.

Dennis: Ah, but suppose there’s more than one self-confirming system, more than one attractor that a renormalizing system settles into, even without any self-bias. The post-Singularity world might end up settling into one of many different states depending on its initial conditions. You could carefully select from among the initial conditions in order to create a world in the state you wanted.

Eileen: Erm, Dennis… that wouldn’t be fair.

Dennis: Yes, that’s my point.

Eileen: No, I mean that self-confirming or not, any AI with a humane sense of fairness would go over her own causal history and discover the fact that her programmer had done that. What you’re describing is a very specific kind of cognitive work, Dennis; it wouldn’t be at all hard to spot. To be specific, you’re describing a programmer extrapolating forward the philosophical entropy of a decision about AI creation strategies to find the set of possible attractors for the AI and their resultant post-Singularity outcomes, judge the desirability of those post-Singularity outcomes on a non-transpersonal or unfair personal basis, and thereby select among the initial conditions on a non-transpersonal or unfair basis. That is comparatively easy to define and is quite blatantly unfair, and any AI with a remotely humane sense of fairness looking back on her own origins would catch it. Even if there are some aspects of this that are underdetermined and have different self-consistent attractors, the unfairness of a programmer optimizing that entropy for personal purposes is quite definite.

Autrey: Even aside from that, it seems obvious to me that this kind of extrapolation is humanly impossible. No one could possibly see that far ahead.

Eileen: I’ve learned not to use words like “humanly impossible”. It only means “I can’t think of any way to do this right now.” There’s no good way to tell which things will still be “humanly impossible” tomorrow.

Autrey: Well, okay, but do you know how to do it?

Eileen: Hell no.

Autrey: Good enough.

Eileen: Besides, if you’ve got a decision that’s self-consistent either way, so that it has philosophical entropy in it, and there’s no transpersonal reason to make the decision one way or another – no argument that would pass between humans to determine it – then what do you care whether the decision is made by me or a quantum coinflip?

Cathryn: Hello? Random still not the same as fair. Shouldn’t such underdetermined decisions be left up to each individual?

Eileen: Heck, even overdetermined decisions can be left up to each individual. I was speaking about some kind of basic structural property of the AI where there are two self-consistent ways to do it, somehow… you know, this whole thing really doesn’t strike me as very likely. In my experience you often have decisions with conflicting considerations, where there are two forces pulling different ways. Decisions where there are no considerations, and both ways are equally self-consistent – that’s a bit hard to see. Anyway, it is not a trivial thing to read that decision far enough ahead to its post-Singularity consequences to “steer” it; the hypothesized ambiguity rules out people agreeing on any better method that should definitely have been used instead; and a Friendly AI could very easily detect that kind of cheating when looking over her own history.

Bernard: Isn’t that answer a little far-fetched?

Eileen: There’s a reason why people see what Dennis is describing as “cheating”. That perception, too, is part of what passes into a humane AI.

Autrey: Hm. What if the programmer isn’t cheating, just taking the best decisions available at the time, and yet this still steers the AI and the post-Singularity future into some specific state out of many possible self-consistent states?

Eileen: And the AI, going back and looking over the programmer’s decisions, doesn’t find any aspirations that settle the issue one way or the other? I.e., there’s no definite way the programmer should have done it instead?

Autrey: Yeah. Just inherent philosophical entropy in the problem – you could go either way, with no deciding factors.

Eileen: I think that situation may be “easier said than done”. There are too many decision criteria you could appeal to.

Autrey: Still.

Eileen: Okay. Let’s say, for the sake of simplifying the argument, that situations like that are settled by resorting to a random number generator. Now it so happens that there are two Friendly AI projects, and they’re identical except for having different randseeds. These randseeds will be used to make thousands of critical decisions on which there is somehow no other way to decide. Depending on the choice of randseed, the future will differ on major variables, which you care about, and yet somehow, even though people care about those outcomes, that isn’t a better way to determine the decision than a pseudorandom coinflip. Now, do you prefer the FAI project with a randseed of 0x61A4 or the randseed of 0xFE22? Answer fast!

Autrey: Um…

Dennis: I want the randseed to be 0x7777.

Eileen: Maybe each of us, as individuals, will make our own choices, decide our own goals, as we grow up, filling out personal variables for issues on which a species AI is neutral; and yet subtle initial conditions of the species AI, with more than one self-consistent solution and no obvious present-day criterion to determine the decision, will have at least some influence on which of these… call them “hobbies”… thrive the best. But if you do not understand the relation of that future to the initial conditions of the present, there is nothing to fight over. For there to be a breakdown of cooperation there has to be a knowable conflict of interest.

Autrey: Not really. Many people need far less excuse than that to spark a breakdown of cooperation.

Eileen: Well, the flip side of that is that a determined altruist can always find a way to cooperate in the Prisoner’s Dilemna. It’s not even hard. Why do so many people seem to want the problem to be unsolvable?

Autrey: Because then they need not worry about solving it.

Cathryn: To go back to one of the pending questions: why pick human nature, or even humane nature, as the key? It may be a “fair” strategy, in the sense of being symmetrical, but is it a good thing? I know that 20% of the cake for each of us is what’s right. I’m not sure that it’s the result of this whole elaborate computation involving the renormalization of human nature… or if that computation defines what is moral, that’s not at all obvious to me.

Eileen: Sure. The correct answer is 20% apiece, regardless of what anyone says about it, right?

Cathryn: Right.

Autrey: Right.

Dennis: Wrong.

Bernard: The correct answer is determined by what people say the correct answer is.

Eileen: So the three of us all agree that 20% apiece is the correct answer regardless of whether we agree on it.

Autrey: Right.

Cathryn: What?

Bernard: That sounds like a distinctly ill-formed statement to me.

Dennis (muttering) : More insanity. Why must they make their excuses so complicated? Could it be that, deep inside, they feel guilty about not giving the cake to me?

Eileen: But even though the three of us agree, it’s possible we could be wrong – that 20% apiece is not the fair distribution.

Cathryn: What? How?

Autrey: You never know. Maybe a real superintelligence would find some solution we didn’t think of, like making another cake to give to Dennis.

Dennis: Forget it! I want 100% of this particular cake in front of me right now. No simulations, duplicates, or anything of the sort – I want this cake and the atoms in this cake. Nothing else is valued by my goal system. And while we’re on the subject, I don’t want to be fooled about whether I got 100% of this cake or not.

Autrey: Well, maybe we could take another cake.

Cathryn: Why should we be inconvenienced on behalf of this lunatic?

Bernard: No, I also want 20% of this particular cake.

Autrey: Why?

Bernard: Just to be difficult.

Autrey: I guess that does rule out most of what I was thinking of…

Eileen: Maybe we’re wrong that 1 divided by 5 is 20%. We’re humans, Cathryn. We screw stuff up.

Cathryn: I agree. Being a fallible human, I am much more inclined to trust simple conclusions, like 20% of the cake apiece being fair, than I am to trust all this complicated thinking we’ve been doing.

Autrey: That’s not the point. The point is that we daren’t take your conclusion about the cake as an absolute.

Eileen: There has to be a way for the AI to progress, to move forward, and that can’t be done if we just take your final conclusions as stored facts in a database. Maybe you’re right about this particular matter of the cake division. But are you right about everything you’re sure of? To reconsider your decision, and perhaps improve it, the AI needs be capable of doing the thinking you used to arrive at your conclusion.

Cathryn: And you, I suppose, would say that the source of my thinking is human nature.

Eileen: Yes. All three of us agree that 20% apiece is the correct answer regardless of whether we agree on it. Why? Because we’re members of the same species.

(work in progress)

Uncategorized

July 5, 2022

The Rationalist Conspiracy