### Suddenly, a taste of freedom

11:30 am

So a few days back, I mentioned that after getting rid of my subconscious idealized assumptions of what a relationship “should” be like, I stopped being so desperate to be in a relationship.

And some time before that, I mentioned that I’d decided to put the whole “saving the world” thing on hold for a few years and focus on taking care of myself first.

As a result, I’ve suddenly found myself having *no* pressing goals that would direct my life. No stress about needing to do something big-impact. No constant loneliness and thinking about how to best impress people.

Just a sudden freedom to do basically anything.

I’m still in the process of disassembling various mental habits that were focused on making me more single-mindedly focused on the twin goals of saving the world and getting into a relationship. But starting to suspect that even more things were defined by those goals than I suspected.

For instance, my self-esteem has usually been pretty bad, probably because I was judging myself and my worth pretty much entirely by how well I did at those two goals. And I didn’t feel like I was doing particularly well at either.

Now I can just… Live a day at a time and not sweat it.

It’s going to take a while to get used to this.

Originally published at Kaj Sotala. You can comment here or there.

### Finding slices of joy

10:08 am

Three weeks ago, I ran across an article called “Google’s former happiness guru developed a three-second brain exercise for finding joy“. Yes, the title is kinda cringe-worthy, but the content is good. Here are the most essential five paragraphs:

Successfully reshaping your mindset, [Chade-Meng Tan] argues, has less to do with hours of therapy and more to do with mental exercises, including one that helps you recognize “thin slices of joy.”

“Right now, I’m a little thirsty, so I will drink a bit of water. And when I do that, I experience a thin slice of joy both in space and time,” he told CBC News. “It’s not like ‘Yay!”” he notes in Joy on Demand. “It’s like, ‘Oh, it’s kind of nice.’”

Usually these events are unremarkable: a bite of food, the sensation of stepping from a hot room to an air-conditioned room, the moment of connection in receiving a text from an old friend. Although they last two or three seconds, the moments add up, and the more you notice joy, the more you will experience joy, Tan argues. “Thin slices of joy occur in life everywhere… and once you start noticing it, something happens, you find it’s always there. Joy becomes something you can count on.” That’s because you’re familiarizing the mind with joy, he explains.

Tan bases this idea on neurological research about how we form habits. Habitual behaviors are controlled by the basal ganglia region of the brain, which also plays a role in the the development of memories and emotions. The better we become at something, the easier it becomes to repeat that behavior without much cognitive effort.

Tan’s “thin slice” exercise contains a trigger, a routine, and a reward—the three parts necessary to build a habit. The trigger, he says, is the pleasant moment, the routine is the noticing of it, and the reward is the feeling of joy itself.

Since then, I have been working on implementing its advice, and making it a habit to notice the various “thin slices of joy” in my life.

It was difficult to remember at first, and on occasions when I’m upset for any reason it’s even harder to follow, even if I do remember it. Still, it is gradually becoming a more entrenched habit, with me remembering it and automatically following it more and more often – and feeling better as a result. I’m getting better at noticing the pleasure in sensations like

• Drinking water.
• Eating food.
• Going to the bathroom.
• Having drops of water fall on my body while in the shower.
• The physicality of brushing teeth, and the clean feeling in the mouth that follows.
• Being in the same room as someone and feeling less alone, even if both are doing their own things.
• Typing on a keyboard and being skilled enough at it to have each finger just magically find the right key without needing to look.

And so on.

Most of these are physical sensations. I would imagine that this would be a lot harder for someone who doesn’t feel comfortable in their body. But for me, a great thing about this is that my body is always with me. Anytime when I’m sitting comfortably – or standing, or lying, or walking comfortably – I can focus my attention on that comfort and get that little bit of joy.

In the article, it said that

“Thin slices of joy occur in life everywhere… and once you start noticing it, something happens, you find it’s always there. Joy becomes something you can count on.” That’s because you’re familiarizing the mind with joy, he explains.

I feel like this is starting to happen to me. Still not reliably, still not always, still easily broken by various emotional upsets.

But I still feel like I’m making definite progress.

Originally published at Kaj Sotala. You can comment here or there.

### Relationship realizations

12:41 pm

Learning experiences: just broke up with someone recently. Part of the problem was that I had some very strong, specific and idealized expectations of what a relationship “should” be like – expectations which caused a lot of trouble, but which I hadn’t really consciously realized that I had, until now.

Digging up the expectations and beating them into mush with a baseball bat came too late to save this particular relationship, but it seems to have had an unexpected side effect: the thought of being single feels a lot less bad now.

I guess that while I had that idealized vision of “being in a relationship”, my mind was constantly comparing singledom to that vision, finding my current existence to be lacking, and feeling bad as a result. But now that I’ve gone from “being in a relationship means X” to “being in a relationship can mean pretty much anything, depending on the people involved”, there isn’t any single vision to compare my current state against. And with nothing to compare against, there’s also nothing that would make me feel unhappy because I don’t have it currently.

Huh.

Originally published at Kaj Sotala. You can comment here or there.

### Software for Moral Enhancement

11:54 am

We all have our weak moments. Moments when we know the right thing to do, but are too tired, too afraid, or too frustrated to do it. So we slip up, and do something that we’ll regret.

An algorithm will never slip up in a weak moment. What if we could identify when we are likely to make mistakes, figure out what we’d want to do instead, and then outsource our decisions to a reliable algorithm? In what ways could we use software to make ourselves into better people?

Passive moral enhancement

One way of doing this might be called passive moral enhancement, because it happens even without anyone thinking about it. For example, if you own a self-driving car, you will never feel the temptation to drink and drive. You can drink as much as you want, but your car will always be the one who drives for you, so you will never endanger others by your drinking.

In a sense this is an uninteresting kind of moral enhancement, since there is nothing novel about it. Technological advancement has always changed the options that we have available to us, and made some vices less tempting while making others more tempting.

In another sense, this is a very interesting kind of change, because simply removing the temptation to do bad is a very powerful way to make progress. If you like drinking, it’s a pure win for you to get to drink rather than having to stay sober just because you’re driving. If we could systematically engineer forms of passive moral enhancement into society, everyone would be better off.

Of course, technology doesn’t always reduce the temptation to do bad. It can also open up new, tempting options for vice. We also need to find ways for people to more actively reshape their moral landscape.

A screenshot from the GoodGuide application.

Reshaping the moral landscape

On the left is a screenshot from GoodGuide. GoodGuide is an application which rates the health, environmental, and societal impact of different products on a scale from 1 to 10, making it easier to choose sustainable products. This is an existing application, but similar ideas could be taken much further.

Imagine having an application which allowed you to specify what you considered to be an ethical product and what kinds of things you needed or liked. Then it would go online and do your shopping for you, automatically choosing the products that best fit your needs and which were also the most ethical by your criteria.

Or maybe your criteria would act as a filter on a search engine, filtering out any products you considered unethical – thus completely removing the temptation to ever buy them, because you’d never even see them.

Would this be enough? Would people be sufficiently motivated to set and use such criteria, just out of the goodness of their hearts?

Probably many would. But it would still be good to also create better incentives for moral behavior.

Software to incentivize moral behavior

This six-way kidney exchange was carried out in 2015 at the California Pacific Medical Center. Sutter Health/California Pacific Medical Center.

On the right, you can see a chain of kidney donations created by organ-matching software.

Here’s how it works. Suppose that my mother has failing kidneys, and that I would like to help her by giving her one of my kidneys. Unfortunately, the compatibility between our kidneys is poor despite our close relation. A direct donation from me to her would be unlikely to succeed.

Fortunately, organ-matching software manages to place us in a chain of exchanges. We are offered a deal. If I donate my kidney to Alice, who’s a complete stranger to me, then another stranger will donate their kidney – which happens to be an excellent match – to my mother. And as a condition for Alice getting a new kidney, Alice’s brother agrees to donate his kidney to another person. That person’s mother agrees to donate her kidney to the next person, and that person’s husband agrees to donate his kidney… and so on. In this way, what was originally a single donation can be transformed into a chain of donations.

As a result of this chain, people who would usually have no interest in helping strangers end up doing so, because they want to help their close ones. By setting up the chain, software has made our interest for our loved ones align together with us helping others.

The more we can develop ways of incentivizing altruism, the better off society will become.

Is this moral enhancement?

At this point, someone might object to calling these things moral enhancement. Is it really moral enhancement if we are removing temptations and changing incentives so that people do more good? How is that better morality – wouldn’t better morality mean making the right decisions when faced with hard dilemmas, rather than dodging the dilemmas entirely?

My response would be that much of the progress of civilization is all about making it easier to be moral.

I have had the privilege of growing up in a country that is wealthy and safe enough that I have never needed to steal or kill. I have never been placed in a situation where those would have been sensible options, let alone necessary for my survival. And because I’ve had the luck of never needing to do those things, it has been easy for me to internalize that killing people or stealing from them are things that you simply don’t do.

Obviously it’s also possible for someone to decide that stealing and killing are wrong despite growing up in a society where they have to do those things. Yet, living in a safer society means that people don’t have to decide it – they just take it for granted. And societies where people have seen less conflict tend to be safer and have more trust in general.

If we can make it easier for people to act in the right way, then more people will end up behaving ways that make both themselves and others better off. I’d be happy to call that moral enhancement.

Whatever we decide to call it, we have an opportunity to use technology to make the world a better place.

Let’s get to it.

Originally published at Kaj Sotala. You can comment here or there.

### An appreciation of the Less Wrong Sequences

02:21 pm
Ruby Bloom recently posted about the significance of Eliezer Yudkowsky‘s Less Wrong Sequences on his thinking. I felt compelled to do the same.

Several people have explicitly told me that I’m one of the most rational people they know. I can also think of at least one case where I was complimented by someone who was politically “my sworn enemy”, who said something along the lines of “I do grant that *your* arguments for your position are good, it’s just everyone *else* on your side…”, which I take as some evidence of me being able to maintain at least some semblance of sanity even when talking about politics.

(Seeing what I’ve written above, I cringe a little, since “I’m so rational” sounds like so much like an over-the-top, arrogant boast. I certainly have plenty of my own biases, as does everyone who is human. Imagining yourself to be perfectly rational is a pretty good way of ensuring that you won’t be, so I’d never claim to be exceptional based only on my self-judgment. But this is what several people have explicitly told me, independently of each other, sometimes also vouching part of their own reputation on it by stating this in public.)

However.

Before reading the Sequences, I was very definitely *not* that. I was what the Sequences would call “a clever arguer” – someone who was good at coming up with arguments for their own favored position, and didn’t really feel all that compelled to care about the truth.

The one single biggest impact of the Sequences that I can think of is that before reading them, as well as Eliezer’s other writings, I didn’t really think that beliefs had to be supported by evidence.

Sure, on some level I acknowledged that you can’t just believe *anything* you can find a clever argument for. But I do also remember thinking something like “yeah, I know that everyone thinks that their position is the correct one just because it’s theirs, but at the same time I just *know* that my position is correct just because it’s mine, and everyone else having that certainty for contradictory beliefs doesn’t change that, you know?”.

This wasn’t a reductio ad absurdum, it was my genuine position. I had a clear emotional *certainty* of being right about something, a certainty which wasn’t really supported by any evidence and which didn’t need to be. The feeling of certainty was enough by itself; the only thing that mattered was in finding the evidence to (selectively) present to others in order to persuade them. Which it likely wouldn’t, since they’d have their own feelings of certainty, similarly blind to most evidence. But they might at least be forced to concede the argument in public.

It was the Sequences that first changed that. It was reading them that made me actually realize, on an emotional level, that correct beliefs *actually* required evidence. That this wasn’t just a game of social convention, but a law of universe as iron-clad as the laws of physics. That if I caught myself arguing for a position where I was making arguments that I knew to be weak, the correct thing to do wasn’t to hope that my opponents wouldn’t spot the weaknesses, but rather to just abandon those weak arguments myself. And then to question whether I even *should* believe that position, having realized that my arguments were weak.

I can’t say that the Sequences alone were enough to take me *all* the way to where I am now. But they made me more receptive to other people pointing out when I was biased, or incorrect. More humble, more willing to take differing positions into account. And as people pointed out more problems in my thinking, I gradually learned to correct some of those problems, internalizing the feedback.

Again, I don’t want to claim that I’d be entirely rational. That’d just be stupid. But to the extent that I’m more rational than average, it all got started with the Sequences.

Ruby wrote:
I was thinking through some challenges and I noticed the sheer density of rationality concepts taught in the Sequences which I was using: “motivated cognition”, “reversed stupidity is not intelligence”, “don’t waste energy of thoughts which won’t have been useful in universes were you win” (possibly not in the Sequences), “condition on all the evidence you have”. These are fundamental concepts, core lessons which shape my thinking constantly. I am a better reasoner, a clearer thinker, and I get closer to the truth because of the Sequences. In my gut, I feel like the version of me who never read the Sequences is epistemically equivalent to a crystal-toting anti-anti-vaxxer (probably not true, but that’s how it feels) who I’d struggle to have a conversation with.
And my mind still boggles that the Sequences were written by a single person. A single person is responsible for so much of how I think, the concepts I employ, how I view the world and try to affect it. If this seems scary, realise that I’d much rather have my thinking shaped by one sane person than a dozen mad ones. In fact, it’s more scary to think that had Eliezer not written the Sequences, I might be that anti-vaxxer equivalent version of me.
I feel very similarly. I have slightly more difficulty pointing to specific concepts from the Sequences that I employ in my daily thinking, because they’ve become so deeply integrated to my thought that I’m no longer explicitly aware of them; but I do remember a period in which they were still in the process of being integrated, and when I explicitly noticed myself using them.

Thank you, Eliezer.

(There’s a collected and edited version of the Sequences available in ebook form. I would recommend trying to read it one article at a time, one per day: that’s how I originally read the Sequences, one article a day as they were being written. That way, they would gradually seep their way into my thoughts over an extended period of time, letting me apply them in various situations. I wouldn’t expect just binge-reading the book in one go to have the same impact, even though it would likely still be of some use.)

Originally published at Kaj Sotala. You can comment here or there.

### Error in Armstrong and Sotala 2012

06:11 pm

Katja Grace has analyzed my and Stuart Armstrong’s 2012 paper “How We’re Predicting AI – or Failing To”. She discovered that one of the conclusions, “predictions made by AI experts were indistinguishable from those of non-experts”, is flawed due to “a spreadsheet construction and interpretation error”. In other words, I coded the data in one way, there was a communication error and a misunderstanding about what the data meant, and as a result of that, a flawed conclusion slipped into the paper.

I’m naturally embarrassed that this happened. But the reason why Katja spotted this error was that we’d made our data freely available, allowing her to spot the discrepancy. This is why data sharing is something that science needs more of. Mistakes happen to everyone, and transparency is the only way to have a chance of spotting those mistakes.

I regret the fact that we screwed up this bit, but proud over the fact that we did share our data and allowed someone to catch it.

EDITED TO ADD: Some people have taken this mistake to suggest that the overall conclusion, that AI experts are not good predictors of AI timelines, to be flawed. That would overstate the significance of this mistake. While one of the lines of evidence supporting this overall conclusion was flawed, several others are unaffected by this error. Namely, the fact that expert predictions disagree widely with each other, that many past predictions have turned out to be false, and that the psychological literature on what’s required for the development of expertise suggests that it should be very hard to develop expertise in this domain. (see the original paper for details)

(I’ve added a note of this mistake to my list of papers.)

Originally published at Kaj Sotala. You can comment here or there.

### Smile, You Are On Tumblr.Com

11:39 am

I made a new tumblr blog. It has photos of smiling people! With more to come!

Why? Previously I happened to need pictures of smiles for a personal project. After going through an archive of photos for a while, I realized that looking at all the happy people made me feel really happy and good. So I thought that I might make a habit out of looking at photos of smiling people, and sharing them.

Follow for a regular extra dose of happiness!

Originally published at Kaj Sotala. You can comment here or there.

### Decisive Strategic Advantage without a Hard Takeoff (part 1)

09:52 am

A common question when discussing the social implications of AI is the question of whether to expect a soft takeoff or a hard takeoff. In a hard takeoff, an AI will, within a relatively short time, grow to superhuman levels of intelligence and become impossible for mere humans to control anymore.

Essentially, a hard takeoff will allow the AI to achieve what’s a so-called decisive strategic advantage (DSA) – “a level of technological and other advantages sufficient to enable it to achieve complete world domination” (Bostrom 2014) – in a very short time. The main relevance of this is that if a hard takeoff is possible, then it becomes much more important to get the AI’s values right on the first try – once the AI has undergone hard takeoff and achieved a DSA, it is in control with whatever values we’ve happened to give to it.

However, if we wish to find out whether an AI might rapidly acquire a DSA, then the question of “soft takeoff or hard” seems too narrow. A hard takeoff would be sufficient, but not necessary for rapidly acquiring a DSA. The more relevant question would be, which competencies does the AI need to master, and at what level relative to humans, in order to acquire a DSA?

Considering this question in more detail reveals a natural reason for why most previous analyses have focused on a hard takeoff specifically. Plausibly, for the AI to acquire a DSA, its level in some offensive capability must overcome humanity’s defensive capabilities. A hard takeoff presumes that the AI becomes so vastly superior to humans in every respect that this kind of an advantage can be taken for granted.

As an example scenario which does not require a hard takeoff, suppose that an AI achieves a capability at biowarfare offense that overpowers biowarfare defense, as well as achieving moderate logistics and production skills. It releases deadly plagues that decimate human society, then uses legally purchased drone factories to build up its own infrastructure and to take over abandoned human facilities.

There are several interesting points to note in conjunction with this scenario:

Attack may be easier than defense. Bruce Schneier writes that

Attackers generally benefit from new security technologies before defenders do. They have a first-mover advantage. They’re more nimble and adaptable than defensive institutions like police forces. They’re not limited by bureaucracy, laws, or ethics. They can evolve faster. And entropy is on their side — it’s easier to destroy something than it is to prevent, defend against, or recover from that destruction.

For the most part, though, society still wins. The bad guys simply can’t do enough damage to destroy the underlying social system. The question for us is: can society still maintain security as technology becomes more advanced?

A single plague, once it has evolved or been developed, can require multi-million dollar responses in order to contain it. At the same time, it is trivial to produce if desired, especially using robots that do not need to fear infection. And creating new variants as new vaccines are developed, may be quite easy, requiring the creation – and distribution – of yet more vaccines.

Another point that Schneier has made is that in order to keep something protected, the defenders have to succeed every time, whereas the attacker only needs to succeed once. This may be particularly hard if the attacker is capable of developing an attack that nobody has used before, such as with hijacked airplanes being used against major buildings in the 9/11 attacks, or with the various vulnerabilities that the Snowden leaks revealed the NSA to have been using for extensive eavesdropping.

Obtaining a DSA may not require extensive intelligence differences. Debates about takeoff scenarios often center around questions such as whether a self-improving AI would quickly hit diminishing returns, and how much room for improvement there is beyond the human level of intelligence. However, these questions may be irrelevant: especially if attack is easier than defense, only a relatively small edge in some crucial competency (such as biological warfare) may be enough to give the AI a DSA.

Exponential growth in the form of normal economic growth may not have produced astounding “fooms” yet, but it has produced plenty of situations where one attacker has gained a temporary advantage over others.

The less the AI cares about human values, the more destructive it may be. An AI which cares mainly about calculating the digits of pi, may be willing to destroy human civilization in order to make sure that a potential threat to it is eliminated. This ensures that it can go on calculating the maximum amount of digits unimpeded.

However, an AI which was programmed to maximize something like the “happiness of currently-living humans” may be much less willing to risk substantial human deaths. This would force it to focus on less destructive takeover methods, potentially requiring more sophisticated abilities.

It is worth noting that this only applies to AIs whose values are defined in terms of how they affect currently existing humans. An AI that was only maximizing human happiness in general might be willing to destroy all existing humans, and then recreate large numbers of humans in simulations.

In effect, the AI’s values determine the level of intelligence it needs to have in order to achieve the kind of a DSA that’s useful for its purposes.

Any destructive plan requires the ability to rebuild afterwards. It would not be of much use for the AI to destroy all of human civilization, if it was dependent on electricity from human-run power plants, and incapable of building or running its own. Thus, purely offensive capabilities will need to be paired with whatever rebuilding capacities are necessary after the offensive capabilities have been deployed.

This calculation may be upset if the AI believes itself to be in an immediate danger of destruction by humans, and believes that its values will still be more likely to be met in a universe where it continues to exist, even if that requires risking a universe where it cannot rebuild after deploying its attack. Thus, being threatened may force the AI’s hand and cause it to launch a potentially destructive offense even when it’s uncertain of its capability to rebuild.

The rational level of aggressiveness depends on the extent to which the AI can selectively target human resources. Human nations generally avoid creating engineered pandemics and using them against their enemies, because they know that the pandemic could easily spread back to them. An AI with no biological components might be immune to this consideration, allowing it to deploy biological weapons more freely. On the other hand, the AI might e.g. need electricity, a dependence which humans did not share and which might give them an advantage in some situation.

A way to formulate this is that attacks differ to the extent to which they can be selectively targeted. Traditional firearms only damage those targets which they are fired upon, whereas pandemics potentially threaten all the members of a species that they can infect. To the extent that the AI needs to rely on the human economy to produce resources that it needs to survive, attacks threatening the economy also threaten the AI’s resources; these resources are in a sense shared between the AI and humanity, so any attacks which cause indiscriminate damage on those resources are dangerous for both. The more the AI can design attacks which selectively deprive resources from its opponents, the lower the threshold it has for using them.

This blog post was written as part of research funded by the Foundational Research Institute.

Originally published at Kaj Sotala. You can comment here or there.

### Simplifying the environment: a new convergent instrumental goal

06:07 am

Convergent instrumental goals (also basic AI drives) are goals that are useful for pursuing almost any other goal, and are thus likely to be pursued by any agent that is intelligent enough to understand why they’re useful. They are interesting because they may allow us to roughly predict the behavior of even AI systems that are much more intelligent than we are.

Instrumental goals are also a strong argument for why sufficiently advanced AI systems that were indifferent towards human values could be dangerous towards humans, even if they weren’t actively malicious: because the AI having instrumental goals such as self-preservation or resource acquisition could come to conflict with human well-being. “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”

I’ve thought of a candidate for a new convergent instrumental drive: simplifying the environment to make it more predictable in a way that aligns with your goals.

Motivation: the more interacting components there are in the environment, the harder it is to predict. Go is a harder game than chess because the number of possible moves is larger, and because even a single stone can influence the game in a drastic fashion that’s hard to know in advance. Simplifying the environment will make it possible to navigate using fewer computational resources; this drive could thus be seen as a subdrive of either the cognitive enhancement or the resource acquisition drive.

Examples:

• Game-playing AIs such as AlphaGo trading expected points for lower variance, by making moves that “throw away” points but simplify the game tree and make it easier to compute.
• Programmers building increasing layers of abstraction that hide the details of the lower levels and let the programmers focus on a minimal number of moving parts.
• People acquiring insurance in order to eliminate unpredictable financial swings, sometimes even when they know that the insurance has lower expected value than not buying it.
• Humans constructing buildings with controlled indoor conditions and a stable “weather”.
• “Better the devil you know”; many people being generally averse to change, even when the changes could quite well be a net benefit; status quo bias.
• Ambiguity intolerance in general being a possible adaptation that helps “implement” this drive in humans.
• Arguably, the homeostasis maintained by e.g. human bodies is a manifestation of this drive, in that having a standard environment inside the body reduces evolution’s search space when looking for beneficial features.

Hammond, Converse & Grass (1995) previously discussed a similar idea, the “stabilization of environments”, according to which AI systems might be built to “stabilize” their environments so as to make them more suited for themselves, and to be easier to reason about. They listed a number of categories:

• Stability of location:The most common type of stability that arises in everyday activity relates to the location of commonly used objects. Our drinking glasses end up in the same place every time we do dishes. Our socks are always together in a single drawer. Everything has a place and we enforce everything ending up in its place.
• Stability of schedule:Eating dinner at the same time every day or having preset meetings that remain stable over time are two examples of this sort of stability. The main advantage of this sort of stability is that it allows for very effective projection in that it provides fixed points that do not have to be reasoned about. In effect, the fixed nature of certain parts of an overall schedule reduces that size of the problem space that has to be searched.
• Stability of resource availability:Many standard plans have a consumable resource as a precondition. If the plans are intended to be used frequently, then availability of the resource cannot be assumed unless it is enforced. A good result of this sort of enforcement is when attempts to use a plan that depends on it will usually succeed. The ideal result is when enforcement is effective enough that the question of availability need not even be raised in connection with running the plan.
• Stability of satisfaction:Another type of stability that an agent can enforce is that of the goals that he tends to satisfy in conjunction with each other. For example, people living in apartment buildings tend to check their mail on the way into their apartments. Likewise, many people will stop at a grocery store on the way home from work. In general, people develop habits that cluster goals together into compact plans, even if the goals are themselves unrelated.
• Stability of plan use:We often find ourselves using familiar plans to satisfy goals even in the face of wideranging possibilities. For example, when one of us travels to conferences, he tends to schedule his flight in to a place as late as he can and plans to leave as late as he can on the last day. This optimizes his time at home and at the conference. It also allows him to plan without knowing anything about the details of the conference schedule. As a result, he has a standard plan that he can run in a wide range of situations without actually planning for them in any detail. It works, because it already deals with the major problems (missing classes at home and important talks at the conference) as part of its structure.
• Stability of cues:One effective technique for improving plan performance is to improve the proper activation of a plan rather than improve the plan itself. For example, placing an important paper that needs to be reviewed on his desk before going home improves the likelihood that an agent will see and read it the next day. Marking calendars and leaving notes serves the same sort of purpose.”

As with other goals, there can be exceptions. In particular, adversarial situations may provide an incentive to increase the complexity. For example, Go players who are losing a game will sometimes intentionally complicate the game, in order to make the game harder for their opponent to figure out and in the hopes of getting lucky. And if the AI did think that it had a better ability to deal with the complexity than its competitors did, then it would have an incentive to maintain or even increase the complexity.

Possible implications:

• AIs trying to take over the world may seek to simplify it, though it’s not clear exactly how they would do this – e.g. if the world was previously relatively well-understood, such as by tapping into existing human research, destructive actions can make it harder to understand by destroying existing dynamics and creating an environment that hasn’t been studied as much. Similarly, humans often “simplify” their environment by creating additional structure (layers of abstraction, insurance contracts, houses).
• In principle, the AI might also seek to complicate the world, in order to make it harder for its human adversaries to understand its plans. However, the world is already so complicated that it seems hard to make any take-over plan work reliably, even for a superintelligence. That said, the AI could still complicate any given subcomponent of its plans that might be thwarted by an opponent capable of predicting what the AI was doing.
• The need for this drive suggests the need for a high amount of trial-and-error for tasks of any nontrivial complexity; even something as simple as a 19×19 Go game requires many games and building up an intuition of it, for humans and AIs alike. Increasing the amount of complexity in the world might make it harder for AIs to take over…
• …but humans tend to build additional complexity that makes the world easier to understand and control, so it’s not clear whether the world moves in a direction that’s easier or harder to take over.
• These ideas might suggest the creation of “active honeypots”, defensive measures that considerably increase the complexity of modeling the human response and which the AI might need to eliminate in order to execute a takeover, thus tipping off the humans.

This blog post was written as part of research funded by the Foundational Research Institute.

Originally published at Kaj Sotala. You can comment here or there.

### AI risk model: single or multiple AIs?

11:48 am

EDIT April 20th: Replaced original graph with a clearer one.

My previous posts have basically been discussing a scenario where a single AI becomes powerful enough to threaten humanity. However, there is no reason to only focus on the scenario with a single AI. Depending on our assumptions, a number of AIs could also emerge at the same time. Here are some considerations.

A single AI

The classic AI risk scenario. Some research group achieves major headway in developing AI, and no others seem to be within reach. For an extended while, it is the success of failure of this AI group that matters.

This would seem relatively unlikely to persist, given the current fierce competition in the AI scene. Whereas a single company could conceivably achieve a major lead in a rare niche with little competition, this seems unlikely to be the case for AI.

A possible exception might be if a company managed to monopolize the domain entirely, or if it had development resources that few others did. For example, companies such as Google and Facebook are currently the only ones with access to large datasets used for machine learning. On the other hand, dependence on such huge datasets is a quirk of current machine learning techniques – an AGI would need the ability to learn from much smaller sets of data. A more plausible crucial asset might be something like supercomputing resources – possibly the first AGIs will need massive amounts of computing power.

Bostrom (2016) discusses the impact of openness on AI development. Bostrom notes that if there is a large degree of openness, and everyone has access to the same algorithms, then hardware may become the primary limiting factor. If the hardware requirements for AI were relatively low, then high openness could lead to the creation of multiple AIs. On the other hand, if hardware was the primary limiting factor and large amounts of hardware were needed, then a few wealthy organizations might be able to monopolize AI for a while.

Branwen (2015) has suggested that hardware production is reliant on a small number of centralized factories that would make easy targets for regulation. This would suggest a possible route by which AI might become amenable to government regulation, limiting the amount of AIs deployed.

Similarly, there have been various proposals of government and international regulation of AI development. If successfully enacted, such regulation might limit the number of AIs that were deployed.

Another possible crucial asset would be the possession of a non-obvious breakthrough insight, one which would be hard for other researchers to come up with. If this was kept secret, then a single company might plausibly develop major headway on others. [how often has something like this actually happened in a non-niche field?]

The plausibility of the single-AI scenario is also affected by the length of a takeoff. If one presumes a takeoff speed that is only a few months, then a single AI scenario seems more likely. Successful AI containment procedures may also increase the chances of there being multiple AIs, as the first AIs remain contained, allowing for other projects to catch up.

Multiple collaborating AIs

A different scenario is one where a number of AIs exist, all pursuing shared goals. This seems most likely to come about if all the AIs are created by the same actor. This scenario is noteworthy because the AIs do not necessarily need to be superintelligent individually, but they may have a superhuman ability to coordinate and put the interest of the group above individual interests (if they even have anything that could be called an individual interest).

This possibility raises the question – if multiple AIs collaborate and share information between each other, to such an extent that the same data can be processed by multiple AIs at a time, how does one distinguish between multiple collaborating AIs and one AI composed of many subunits? This is arguably not a distinction that would “cut reality at the joints”, and the difference may be more a question of degree.

The distinction likely makes more sense if the AIs cannot completely share information between each other, such as because each of them has developed a unique conceptual network, and cannot directly integrate information from the others but has to process it in its own idiosyncratic way.

Multiple AIs with differing goals

A situation with multiple AIs that did not share the same goals could occur if several actors reached the capability for building AIs around the same time. Alternatively, a single organization might deploy multiple AIs intended to achieve different purposes, which might come into conflict if measures to enforce cooperativeness between them failed or were never deployed in the first place (maybe because of an assumption that they would have non-overlapping domains).

One effect of having multiple groups developing AIs is that this scenario may remove the possibilities of stopping to pursue further safety measures before deploying the AI, or of deploying an AI with safeguards that reduce performance (Bostrom 2016). If the actor that deploys the most effective AI earliest on can dominate others who take more time, then the more safety-conscious actors may never have the time to deploy their AIs.

Even if none of the AI projects chose to deploy their AIs carelessly, the more AI projects there are, the more likely it becomes that at least one of them will have their containment procedures fail.

The possibility has been raised that having multiple AIs with conflicting goals would be a good thing, in that it would allow humanity to play the AIs against each other. This seems highly unobvious, for it is not clear why humans wouldn’t simply be caught in the crossfire. In a situation with superintelligent agents around, it seems more likely that humans would be the ones that would be played with.

Bostrom (2016) also notes that unanticipated interactions between AIs already happen even with very simple systems, such as in the interactions that led to the Flash Crash, and that particularly AIs that reasoned in non-human ways could be very difficult for humans to anticipate once they started basing their behavior on what the other AIs did.

A model with assumptions

Here’s a new graphical model about an AI scenario, embodying a specific set of assumptions. This one tries to take a look at some of the factors that influence whether there might be a single or several AIs.

This model both makes a great number of assumptions, AND leaves out many important ones! For example, although I discussed openness above, openness is not explicitly included in this model. By sharing this, I’m hoping to draw commentary on 1) which assumptions people feel are the most shaky and 2) which additional ones are valid and should be explicitly included. I’ll focus on those ones in future posts.

Written explanations of the model:

We may end up in a scenario where there is (for a while) only a single or a small number of AIs if at least one of the following is true:

• The breakthrough needed for creating AI is highly non-obvious, so that it takes a long time for competitors to figure it out
• AI requires a great amount of hardware and only a few of the relevant players can afford to run it
• There is effective regulation, only allowing some authorized groups to develop AI

We may end up with effective regulation at least if:

• AI requires a great amount of hardware, and hardware is effectively regulated

(this is not meant to be the only way by which effective regulation can occur, just the only one that was included in this flowchart)

We may end up in a scenario where there are a large number of AIs if:

• There is a long takeoff and competition to build them (ie. ineffective regulation)

If there are few AI, and the people building them take their time to invest in value alignment and/or are prepared to build AIs that are value-aligned even if that makes them less effective, then there may be a positive outcome.

If people building AIs do not do these things, then AIs are not value aligned and there may be a negative outcome.

If there are many AI and there are people who are ready to invest time/efficency to value-aligned AI, then those AIs may become outcompeted by AIs whose creators did not invest in those things, and there may be a negative outcome.

Not displayed in the diagram because it would have looked messy:

• If there’s a very short takeoff, this can also lead to there only being a single AI, since the first AI to cross a critical threshold may achieve dominance over all the others. However, if there is fierce competition this still doesn’t necessarily leave time for safeguards and taking time to achieve safety – other teams may also be near the critical threshold.

This blog post was written as part of research funded by the Foundational Research Institute.

Originally published at Kaj Sotala. You can comment here or there.

### Disjunctive AI risk scenarios: AIs gaining the power to act autonomously

10:59 am

Previous post in series: AIs gaining a decisive advantage

Series summary: Arguments for risks from general AI are sometimes criticized on the grounds that they rely on a series of linear events, each of which has to occur for the proposed scenario to go through. For example, that a sufficiently intelligent AI could escape from containment, that it could then go on to become powerful enough to take over the world, that it could do this quickly enough without being detected, etc. The intent of this series of posts is to briefly demonstrate that AI risk scenarios are in fact disjunctive: composed of multiple possible pathways, each of which could be sufficient by itself. To successfully control the AI systems, it is not enough to simply block one of the pathways: they all need to be dealt with.

Previously, I drew on arguments from my and Roman Yampolskiy’s paper Responses to Catastrophic AGI Risk, to argue that there are several alternative ways by which AIs could gain a decisive advantage over humanity, any one of which could lead to that outcome. In this post, I will draw on arguments from the same paper to examine another question: what different routes are there for an AI to gain the capability to act autonomously? (this post draws on sections 4.1. and 5.1. of our paper, as well adding some additional material)

Autonomous AI capability

A somewhat common argument concerning AI risk is that AI systems aren’t a threat because we will keep them contained, or “boxed”, thus limiting what they are allowed to do. How might this line of argument fail?

1. The AI escapes

A common response is that a sufficiently intelligent AI will somehow figure out a way to escape, either by social engineering or by finding an exploitable weakness in the physical security arrangements. This possibility has been extensively discussed in a number of papers, including Chalmers (2012) and Armstrong, Sandberg &  Bostrom (2012)Writers have generally been cautious about making strong claims of our ability to keep a mind much smarter than ourselves contained against its will. However, with cautious design it may still be possible to design an AI combining some internal motivation to stay contained, and combine that with a number of external safeguards monitoring the AI.

2. The AI is voluntarily released

AI confinement assumes that the people building it are motivated to actually keep the AI confined. If a group of cautious researchers builds and successfully contains their AI, this may be of limited benefit if another group later builds an AI that is intentionally set free. Why would anyone do this?

2a. Voluntarily released for economic benefit or competitive pressure

As already discussed in the previous post, the historical trend has been to automate everything that can be automated, both to reduce costs and because machines can do things better than humans can. If you have any kind of a business, you could potentially make it run better by putting a sufficiently sophisticated AI in charge – or even replace all the human employees with one. The AI can think faster and smarter, deal with more information at once, and work for a unified purpose rather than have its efficiency weakened by the kinds of office politics that plague any large organization.

The trend towards automation has been going on throughout history, doesn’t show any signs of stopping, and inherently involves giving the AI systems whatever agency they need in order to run the company better. If your competitors are having AIs run their company and you don’t, you’re likely to be outcompeted, so you’ll want to make sure your AIs are smarter and more capable of acting autonomously than the AIs of the competitors. These pressures are likely to first show up when AIs are still comfortably narrow, and intensify even as the AIs gradually develop towards general intelligence.

The trend towards giving AI systems more power and autonomy might be limited by the fact that doing this poses large risks for the company if the AI malfunctions. This limits the extent to which major, established companies might adopt AI-based control, but incentivizes startups to try to invest in autonomous AI in order to outcompete the established players. There currently also exists the field of algorithmic trading, where AI systems are trusted with enormous sums of money despite the potential to make enormous losses – in 2012, Knight Capital lost \$440 million due to a glitch in their software. This suggests that even if a malfunctioning AI could potentially cause major risks, some companies will still be inclined to invest in placing their business under autonomous AI control if the potential profit is large enough.

The trend towards giving AI systems more autonomy can also be seen in the military domain. Wallach and Allen (2012) discuss the topic of autonomous robotic weaponry and note that the US military is seeking to eventually transition to a state where the human operators of robot weapons are “on the loop” rather than “in the loop.” In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robot’s actions and interfere if something goes wrong.

Human Rights Watch (2012) reports on a number of military systems which are becoming increasingly autonomous, with the human oversight for automatic weapons defense systems—designed to detect and shoot down incoming missiles and rockets— already being limited to accepting or overriding the computer’s plan of action in a matter of seconds, which may be too little to make a meaningful decision in practice. Although these systems are better described as automatic, carrying out preprogrammed sequences of actions in a structured environment, than autonomous, they are a good demonstration of a situation where rapid decisions are needed and the extent of human oversight is limited. A number of militaries are considering the future use of more autonomous weapons.

2b. Voluntarily released for aesthetic, ethical, or philosophical reasons

A few thinkers (such as Gunkel 2012) have raised the question of moral rights for machines, and not everyone necessarily agrees that confining an AI is ethically acceptable. Even if the designer of an AI knew that it did not have a process that corresponded to the ability to suffer, they might come to view it as something like their child, and feel that it deserved the right to act autonomously.

2c. Voluntarily released due to confidence in the AI’s safety

For a research team to keep an AI confined, they need to take seriously the possibility of it being dangerous in the first place. Current AI research doesn’t involve any confinement safeguards, as the researchers reasonably believe that their systems are nowhere near general intelligence yet. Many systems are also connected directly to the Internet. Hopefully safeguards will begin to be implemented once the researchers feel that their system might start having more general capability, but this will depend on the safety culture of the AI research community in general, and the specific research group in particular.

In addition to believing that the AI is insufficiently capable of being a threat, the researchers may also (correctly or incorrectly) believe that they have succeeded in making the AI aligned with human values, so that it will not have any motivation to harm humans.

2d. Voluntarily released due to desperation

Miller (2012) points out that if a person was close to death, due to natural causes, being on the losing side of a war, or any other reason, they might turn even a potentially dangerous AGI system free. This would be a rational course of action as long as they primarily valued their own survival and thought that even a small chance of the AGI saving their life was better than a near-certain death.

3. The AI remains contained, but ends up effectively in control anyway

Even if humans were technically kept in the loop, they might not have the time, opportunity, motivation, intelligence, or confidence to verify the advice given by an AI. This would particularly be the case after the AI had functioned for a while, and established a reputation as trustworthy. It may become common practice to act automatically on the AI’s recommendations, and it may become increasingly difficult to challenge the ‘authority’ of the recommendations. Eventually, the AI may in effect begin to dictate decisions (Friedman and Kahn 1992).

Likewise, Bostrom and Yudkowsky (2011) point out that modern bureaucrats often follow established procedures to the letter, rather than exercising their own judgment and allowing themselves to be blamed for any mistakes that follow. Dutifully following all the recommendations of an AI system would be an even better way of avoiding blame.

Wallach and Allen (2012) note the existence of robots which attempt to automatically detect the locations of hostile snipers and to point them out to soldiers. To the extent that these soldiers have come to trust the robots, they could be seen as carrying out the robots’ orders. Eventually, equipping the robot with its own weapons would merely dispense with the formality of needing to have a human to pull the trigger.

Conclusion.

Merely developing ways to keep AIs confined is not a sufficient route to ensure that they cannot become an existential risk – even if we knew that those ways worked. Various groups may have different reasons to create autonomously-acting AIs that are intentionally allowed to act by themselves, and even an AI that was successfully kept contained might still end up dictating human decisions in practice. All of these issues will need to be considered in order to keep advanced AIs safe.

This blog post was written as part of research funded by the Foundational Research Institute.

Originally published at Kaj Sotala. You can comment here or there.

### Disjunctive AI risk scenarios: AIs gaining a decisive advantage

12:59 pm

Arguments for risks from general AI are sometimes criticized on the grounds that they rely on a series of linear events, each of which has to occur for the proposed scenario to go through. For example, that a sufficiently intelligent AI could escape from containment, that it could then go on to become powerful enough to take over the world, that it could do this quickly enough without being detected, etc.

The intent of my following series of posts is to briefly demonstrate that AI risk scenarios are in fact disjunctive: composed of multiple possible pathways, each of which could be sufficient by itself. To successfully control the AI systems, it is not enough to simply block one of the pathways: they all need to be dealt with.

In this post, I will be drawing on arguments discussed in my and Roman Yampolskiy’s paper, Responses to Catastrophic AGI Risk (section 2), and focusing on one particular component of AI risk scenarios: AIs gaining a decisive advantage over humanity. Follow-up posts will discuss other disjunctive scenarios discussed in Responses, as well as in other places.

Suppose that we built a general AI. How could it become powerful enough to end up threatening humanity?

1. Discontinuity in AI power

The classic scenario is one in which the AI ends up rapidly gaining power, so fast that humans are unable to react. We can say that this is a discontinuous scenario, in that the AI’s power grows gradually until it suddenly leaps to an entirely new level. Responses describes three different ways for this to happen:

1a. Hardware overhang. In a hardware overhang scenario, hardware develops faster than software, so that we’ll have computers with more computing power than the human brain does, but no way of making effective use of all that power. If someone then developed an algorithm for general intelligence that could make effective use of that hardware, we might suddenly have an abundance of cheap hardware that could be used for running thousands or millions of AIs, possibly with a speed of thought much faster than that of humans.

1b. Speed explosion. In a speed explosion scenario, intelligent machines design increasingly faster machines. A hardware overhang might contribute to a speed explosion, but is not required for it. An AI running at the pace of a human could develop a second generation of hardware on which it could run at a rate faster than human thought. It would then require a shorter time to develop a third generation of hardware, allowing it to run faster than on the previous generation, and so on. At some point, the process would hit physical limits and stop, but by that time AIs might come to accomplish most tasks at far faster rates than humans, thereby achieving dominance. In principle, the same process could also be achieved via improved software.

The extent to which the AI needs humans in order to produce better hardware will limit the pace of the speed explosion, so a rapid speed explosion requires the ability to automate a large proportion of the hardware manufacturing process. However, this kind of automation may already be achieved by the time that AI is developed.

1c. Intelligence explosion. In an intelligence explosion, an AI figures out how to create a qualitatively smarter AI and that smarter AI uses its increased intelligence to create still more intelligent AIs, and so on. such that the intelligence of humankind is quickly left far behind and the machines achieve dominance.

One should note that the three scenarios depicted above are by no means mutually exclusive! A hardware overhang could contribute to a speed explosion which could contribute to an intelligence explosion which could further the speed explosion, and so on. So we are dealing with three basic events, which could then be combined in different ways.

2. Power gradually shifting to AIs

While the traditional AI risk scenario involves a single AI rapidly acquiring power (a “hard takeoff”), society is also gradually becoming more and more automated, with machines running an increasing share of things. There is a risk that AI systems that were initially simple and of limited intelligence would gradually gain increasing power and responsibilities as they learned and were upgraded, until large parts of society were under the AI’s control – and it might not remain docile forever.

Labor is automated for reasons of cost, efficiency and quality. Once a machine becomes capable of performing a task as well as (or almost as well as) a human, the cost of purchasing and maintaining it may be less than the cost of having a salaried human perform the same task. In many cases, machines are also capable of doing the same job faster, for longer periods and with fewer errors.

If workers can be affordably replaced by developing more sophisticated AI, there is a strong economic incentive to do so. This is already happening with narrow AI, which often requires major modifications or even a complete redesign in order to be adapted for new tasks. To the extent that an AI could learn to do many kinds of tasks—or even any kind of task—without needing an extensive re-engineering effort, the AI could make the replacement of humans by machines much cheaper and more profitable. As more tasks become automated, the bottlenecks for further automation will require adaptability and flexibility that narrow-AI systems are incapable of. These will then make up an increasing portion of the economy, further strengthening the incentive to develop AI – as well as to turn over control to it.

Conclusion. This gives a total of four different scenarios by which AIs could gain a decisive advantage over humans. And note that, just as scenarios 1a-1c were not mutually exclusive, neither is scenario 2 mutually exclusive with scenarios 1a-1c! An AI that had gradually acquired a great deal of power could at some point also find a way to make itself far more powerful than before – and it could already have been very powerful.

This blog post was written as part of research funded by the Foundational Research Institute.

Originally published at Kaj Sotala. You can comment here or there.

### Reality is broken, or, an XCOM2 review

11:03 am

Yesterday evening I went to the grocery store, and was startled to realize that I was suddenly in a totally different world.

Computer games have difficulty grabbing me these days. Many of the genres I used to enjoy as a kid have lost their appeal: point-and-click -style adventure requires patience and careful thought, but I already deal with plenty of things that require patience and careful thought in real life, so for games I want something different. 4X games mostly seem like pure numerical optimization exercises these days, and have lost that feel of discovery and sense of wonder. In general, I used to like genres like turn-based strategy or adventure that had no time constraints, but those now usually feel too slow-paced to pull me in; whereas pure action action games I’ve never been particularly good at. (I tried Middle-Earth: Shadow of Mordor for a bit recently, and quit after a very frustrating two hours where I attempted a simple beginning quest for about a dozen times, only to be killed by the same orc each time.)

Like the previous XCOM remake, Firaxis’s XCOM2 managed the magic of transporting me completely elsewhere, in the same way that some of my childhood classics did. I did not even properly realize how deeply I’d become immersed the game, until I went outside, and the sheer differentness of the real world and the game world startled me – somewhat similar to the shock of jumping into cold water, your body suddenly and obviously piercing through a surface that separates two different realms of existence.

A good description of my experience with the game comes, oddly enough, from Michael Vassar describing something that’s seemingly completely different. He talks about the way that two people, acting together, can achieve such a state of synchrony that they seem to meld into a single being:

In real-time domains, one rapidly assesses the difficulty of a challenge. If the difficulty seems manageable, one simply does, with no holding back, reflecting, doubting, or trying to figure out how one does. Figuring out how something is done implicitly by a neurological process which is integrated with doing. Under such circumstances, acting intuitively in real time, the question of whether an action is selfish or altruistic or both or neither never comes up, thus in such a flow state one never knows whether one is acting cooperatively, competitively, or predatorily. People with whom you are interacting […] depend on the fact that you and they are in a flow-state together. In so far as they and you become an integrated process, your actions flow from their agency as well as your own[.]

XCOM2 is not actually a real-time game: it is firmly turn-based. Yet your turns are short and intense, and the game’s overall aesthetics reinforce a feeling of rapid action and urgency. There is a sense in which it feels like the player and the game become melded together, there being a constant push-and-pull in which you act and the game responds; the game acts and you respond. A feeling of complete immersion and synchrony with your environment, with a perfect balance between the amount of time that it pays to think and the amount of time that it pays to act, so that the pace neither slows down to a crawl nor becomes one of rushed doing without understanding.

It is in some ways a scary effect: returning to the mundaneness of the real world, there was a strong sense of “it’s so sad that all of my existence can’t be spent playing games like that”, and a corresponding realization of how dangerous that sentiment was. Yet it felt very different from the archetypical addiction: there wasn’t that feel of an addict’s understanding of how ultimately dysfunctional the whole thing was, or struggling against something which you knew was harmful and of no real redeeming value. Rather, it felt like a taste of what human experience should be like, of how sublime and engaging our daily reality could be, but rarely is.

Jane McGonigal writes, in her book Reality is Broken:

Where, in the real world, is that gamer sense of being fully alive, focused, and engaged in every moment? Where is the gamer feeling of power, heroic purpose, and community? Where are the bursts of exhilarating and creative game accomplishment? Where is the heart-expanding thrill of success and team victory? While gamers may experience these pleasures occasionally in their real lives, they experience them almost constantly when they’re playing their favorite games. […]

Reality, compared to games, is broken. […]

The truth is this: in today’s society, computer and video games are fulfilling genuine human needs that the real world is currently unable to satisfy. Games are providing rewards that reality is not. They are teaching and inspiring and engaging us in ways that reality is not. They are bringing us together in ways that reality is not.

If enough good games were available, it would be easy to just get lost in games, to escape the brokeness of reality and retreat to a more perfect world. Perhaps I’m lucky in that I rarely encounter games of this caliber, that would be so much more moment-to-moment fulfilling than the real world is. Firaxis’s previous XCOM also had a similar immersive effect on me, but eventually I learned the game and it ceased to hold new surprises, and it lost its hold. Eventually the sequel will also have most of its magic worn away.

It’s likely better this way. This way it can function for me the way that art should: not as a mindless escape, but as a moment of beauty that reminds us that it’s possible to have a better world than this. As a reminder that we can work to bring the world closer to that.

McGonigal continues:

What if we decided to use everything we know about game design to fix what’s wrong with reality? What if we started to live our real lives like gamers, lead our real businesses and communities like game designers, and think about solving real-world problems like computer and video game theorists? […]

Instead of providing gamers with better and more immersive alternatives to reality, I want all of us to be responsible for providing the world at large with a better and more immersive reality […] take everything game developers have learned about optimizing human experience and organizing collaborative communities and apply it to real life

We can do that.

Originally published at Kaj Sotala. You can comment here or there.

### Me and Star Wars

10:10 am

Unlike the other kids in my neighborhood, who went to the Finnish-speaking elementary school right near our suburban home, I went to a Swedish-speaking school much closer to the inner city. Because of this, my mom would come pick me up from school, and sometimes we would go do things in town, since we were already nearby.

At one point we developed a habit of making a video rental store the first stop after school. We’d return whatever we had rented the last time, and I’d get to pick one thing to rent next. The store had a whole rack devoted to NES games, and there was a time when I was systematically going through their whole collection, seeking to play everything that seemed interesting. But at times I would also look at their VHS collection, and that was how I first found Star Wars.

I don’t have a recollection of what it was to see any of the Star Wars movies for the very first time. But I do have various recollections of how they influenced my life, afterwards.

For many years, there was “Sotala Force”, an imaginary space army in a setting of make believe that combined elements of Star Wars and Star Trek. I was, of course, its galaxy-famous leader, with some of my friends at the time holding top positions in it. It controlled maybe one third of the galaxy, and its largest enemy was something very loosely patterned after the Galactic Empire, which held maybe four tenths of the galaxy.

The leader of the enemy army, called (Finns, don’t laugh too much now) Kiero McLiero, took on many traits from Emperor Palpatine. These included the ability, taken from the Dark Empire comics, to keep escaping death by always resurrecting in a new body, meaning that our secret missions attacking his bases could end in climactic end battles where we’d kill him, over and over again. Naturally, me and my friends were Jedi Knights and Masters, using a combination of the Force, lightsabers, and whatever other weapons we happened to have, to carry out our noble missions.

There was a girl in elementary school who I sometimes hung out with, and who I had a huge and hopelessly unrequited crush on. Among other shared interests like Lord of the Rings, we were both fans of Star Wars, and would sometimes discuss it. I only remember some fragments of those discussions: an agreement that Empire Strikes Back and Return of the Jedi were superior movies to A New Hope; both having heard of the Tales of the Jedi comics but neither having managed to find them anywhere; a shared feeling of superiority and indignation towards everyone who was making such a blown-out-of-proportions fuss about Jar-Jar Binks in the Phantom Menace, given that Lucas had clearly said that he was aiming these new movies at children.

The third last memory I have of seeing her, was at a trip to a beach we had at the end of 9th grade; I’d brought a toy dual-bladed lightsaber, while she’d brought a single-bladed one. There were many duels on that beach.

The very last memory that I have of seeing her, after we’d gone on to different schools, was when we ran across each other in the premiere of the Revenge of the Sith, three years later. We chatted a bit about the movie, what had happened to us in the intervening years, and then went our separate ways again.

For a kid interested in computer games in 1990s Finland, Pelit (“Games”) was The magazine to read. Another magazine that was of interest, also having computer games but mostly covering more general PC issues, was MikroBitti. Of these, both occasionally discussed a fascinating-sounding thing, table-top role-playing games, with MikroBitti running a regular column that discussed them. They sounded totally awesome and I wanted to get one. I asked my dad if I could have an RPG, and he was willing to buy one, if only I told him what they looked like and where they might be found. This was the part that left me stumped.

Until one day I found a store that… I don’t remember what exactly it sold. It might have been an explicit gaming store or it might only have had games as one part of its collection. And I have absolutely no memory of how I found it. But one way or the other, there it was, including the star prize: a Star Wars role-playing game (the West End Games one, second edition).

For some reason that I have forgotten, I didn’t actually get the core rules at first. The first thing that I got was a supplement, Heroes & Rogues, which had a large collection of different character templates depicting all kinds of Rebel, Imperial, and neutral characters, as well as an extended “how to make a realistic character” section. The book was in English, but thanks to my extensive NES gaming experience, I could read it pretty well at that point. Sometime later, I got the actual core rules.

I’m not sure if I started playing right away; I have the recollection that I might have spent a considerable while just buying various supplements for the sake of reading them, before we started actually playing. “We” in this case was me and one friend of mine, because we didn’t have anyone else to play with. This resulted in creative non-standard campaigns, in which we both had several characters (in addition to me also being the game master) who we played simultaneously. Those games lasted until we found the local university’s RPG club (which also admitted non-university students; I think I was 13 the first time I showed up). After finding it, we transitioned to more ordinary campaigns and those weird two-player mishmashes ended. They were fun while they lasted, though.

After the original gaming store where I’d been buying my Star Wars supplements closed, I eventually found another. And it didn’t only have Star Wars RPG supplements! It also had Star Wars novels that were in English, which had never been translated into Finnish!

So it came to be that the first novel that I read in English was X-Wing: Wedge’s Gamble, telling the story of the Rebellion’s (or, as it was known by that time, the New Republic’s) struggle to capture Coruscant some years after the events in Return of the Jedi. I remember that this was sometime in yläaste (“upper elementary school”), so I was around 13-15 years old. An actual novel was a considerably bigger challenge for my English-reading skills than RPG supplements were, so there was a lot of stuff in the novel that I didn’t quite get. But still, I finished it, and then went on to buy and read the rest of the novels in the X-Wing series.

The Force Awakens, Disney’s new Star Wars film, comes out today. Star Wars has previously been a part of many notable things in my life. It shaped the make believe setting that I spent several years playing in, it was one of the things I had in common with the first girl I ever had a crush on, its officially licensed role-playing game was the first one that I ever played, and one of its licensed novels was the first novel that I ever read in English.

Today it coincides with another major life event. The Finnish university system is different from the one in many other countries in that, for a long while, we didn’t have any such thing as a Bachelor’s degree. You were admitted to study for five years, and then at the end, you would graduate with a Master’s degree. Reforms carried out in 2005, intended to make Finnish higher education more compatible with the systems in other countries, introduced the concept of a Bachelor’s degree as an intermediary step that you needed to do in between. But upon being admitted to university, you would still be given the right to do both degrees, and people still don’t consider a person to have really graduated before they have their Master’s.

I was admitted to university back in 2006. For various reasons, my studies have taken longer than the recommended time, which would have had me graduating with my Master’s in 2011. But late, as they say, is better than never: today’s my official graduation day for my MSc degree. There will be a small ceremony at the main university building, after which I will celebrate by going to see what my old friends Luke, Leia and Han are up to these days.

Originally published at Kaj Sotala. You can comment here or there.

### Desiderata for a model of human values

06:26 pm

Soares (2015) defines the value learning problem as

By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?

There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering.

To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions.

By “human values”, I here basically mean the values of any given individual: we are not talking about the values of, say, a whole culture, but rather just one person within that culture. While the problem of aggregating or combining the values of many different individuals is also an important one, we should probably start from the point where we can understand the values of just a single person, and then use that understanding to figure out what to do with conflicting values.

In order to make the purpose of this exercise as clear as possible, let’s start with the most important desideratum, of which all the others are arguably special cases of:

1. Useful for AI safety engineering. Our model needs to be useful for the purpose of building AIs that are aligned with human interests, such as by making it possible for an AI to evaluate whether its model of human values is correct, and by allowing human engineers to evaluate whether a proposed AI design would be likely to further human values.

In the context of AI safety engineering, the main model for human values that gets mentioned is that of utility functions. The one problem with utility functions that everyone always brings up, is that humans have been shown not to have consistent utility functions. This suggests two new desiderata:

2. Psychologically realistic. The proposed model should be compatible with that which we know about current human values, and not make predictions about human behavior which can be shown to be empirically false.

3. Testable. The proposed model should be specific enough to make clear predictions, which can then be tested.

As additional requirements related to the above ones, we may wish to add:

4. Functional. The proposed model should be able to explain what the functional role of “values” is: how do they affect and drive our behavior? The model should be specific enough to allow us to construct computational simulations of agents with a similar value system, and see whether those agents behave as expected within some simulated environment.

5. Integrated with existing theories. The proposed definition model should, to as large an extent possible, fit together with existing knowledge from related fields such as moral psychology, evolutionary psychology, neuroscience, sociology, artificial intelligence, behavioral economics, and so on.

However, I would argue that as a model of human value, utility functions also have other clear flaws. They do not clearly satisfy these desiderata:

6. Suited for modeling internal conflicts and higher-order desires. A drug addict may desire a drug, while also desiring that he not desire it. More generally, people may be genuinely conflicted between different values, endorsing contradictory sets of them given different situations or thought experiments, and they may struggle to behave in a way in which they would like to behave. The proposed model should be capable of modeling these conflicts, as well as the way that people resolve them.

7. Suited for modeling changing and evolving values. A utility function is implicitly static: once it has been defined, it does not change. In contrast, human values are constantly evolving. The proposed model should be able to incorporate this, as well as to predict how our values would change given some specific outcomes. Among other benefits, an AI whose model of human values had this property might be able to predict things that our future selves would regret doing (even if our current values approved of those things), and warn us about this possibility in advance.

8. Suited for generalizing from our existing values to new ones. Technological and social change often cause new dilemmas, for which our existing values may not provide a clear answer. As a historical example (Lessig 2004), American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question – could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? In answer to this question, the concept of landownership was redefined to only extend a limited, and not an indefinite, amount upwards. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. Our model of value should be capable of figuring out such compromises, rather than treating values such as landownership as black boxes, with no understanding of why people value them.

As an example of using the current criteria, let’s try applying them to the only paper that I know of that has tried to propose a model of human values in an AI safety engineering context: Sezener (2015). This paper takes an inverse reinforcement learning approach, modeling a human as an agent that interacts with its environment in order to maximize a sum of rewards. It then proposes a value learning design where the value learner is an agent that uses Solomonoff’s universal prior in order to find the program generating the rewards, based on the human’s actions. Basically, a human’s values are equivalent to a human’s reward function.

Let’s see to what extent this proposal meets our criteria.

1. Useful for AI safety engineering. To the extent that the proposed model is correct, it would clearly be useful. Sezener provides an equation that could be used to obtain the probability of any given program being the true reward generating program. This could then be plugged directly into a value learning agent similar to the ones outlined in Dewey (2011), to estimate the probability of its models of human values being true. That said, the equation is incomputable, but it could be possible to construct computable approximations.
2. Psychologically realistic. Sezener assumes the existence of a single, distinct reward process, and suggests that this is a “reasonable assumption from a neuroscientific point of view because all reward signals are generated by brain areas such as the striatum”. On the face of it, this seems like an oversimplification, particularly given evidence suggesting the existence of multiple valuation systems in the brain. On the other hand, since the reward process is allowed to be arbitrarily complex, it could be taken to represent just the final output of the combination of those valuation systems.
3. Testable. The proposed model currently seems to be too general to be accurately tested. It would need to be made more specific.
4. Functional. This is arguable, but I would claim that the model does not provide much of a functional account of values: they are hidden within the reward function, which is basically treated as a black box that takes in observations and outputs rewards. While a value learner implementing this model could develop various models of that reward function, and those models could include internal machinery that explained why the reward function output various rewards at different times, the model itself does not make any assumptions of this.
5. Integrated with existing theories. Various existing theories could in principle used to flesh out the internals of the reward function, but currently no such integration is present.
6. Suited for modeling internal conflicts and higher-order desires. No specific mention of this is made in the paper. The assumption of a single reward function that assigns a single reward for every possible observation seems to implicitly exclude the notion of internal conflicts, with the agent always just maximizing a total sum of rewards and being internally united in that goal.
7. Suited for modeling changing and evolving values. As written, the model seems to consider the reward function as essentially unchanging: “our problem reduces to finding the most probable $p_R$ given the entire action-observation history $a_1o_1a_2o_2 . . . a_no_n$.”
8. Suited for generalizing from our existing values to new ones. There does not seem to be any obvious possibility for this in the model.

I should note that despite its shortcomings, Sezener’s model seems like a nice step forward: like I said, it’s the only proposal that I know of so far that has even tried to answer this question. I hope that my criteria would be useful in spurring the development of the model further.

As it happens, I have a preliminary suggestion for a model of human values which I believe has the potential to fulfill all of the criteria that I have outlined. However, I am far from certain that I have managed to find all the necessary criteria. Thus, I would welcome feedback, particularly including proposed changes or additions to these criteria.

Originally published at Kaj Sotala. You can comment here or there.

### Learning from painful experiences

10:42 am

A model that I’ve found very useful is that pain is an attention signal. If there’s a memory or thing that you find painful, that’s an indication that there’s something important in that memory that your mind is trying to draw your attention to. Once you properly internalize the lesson in question, the pain will go away.

That’s a good principle, but often hard to apply in practice. In particular, several months ago there was a social situation that I screwed up big time, and which was quite painful to think of afterwards. And I couldn’t figure out just what the useful lesson was there. Trying to focus on it just made me feel like a terrible person with no social skills, which didn’t seem particularly useful.

Yesterday evening I again discussed it a bit with someone who’d been there, which helped relieve the pain a bit, enough that the memory wasn’t quite as aversive to look at. Which made it possible for me to imagine myself back in that situation and ask, what kinds of mental motions would have made it possible to salvage the situation? When I first saw the shocked expressions of the people in question, instead of locking up and reflexively withdrawing to an emotional shell, what kind of an algorithm might have allowed me to salvage the situation?

Answer to that question: when you see people expressing shock in response to something that you’ve said or done, realize that they’re interpreting your actions way differently than you intended them. Starting from the assumption that they’re viewing your action as bad, quickly pivot to figuring out why they might feel that way. Explain what your actual intentions were and that you didn’t intend harm, apologize for any hurt you did cause, use your guess of why they’re reacting badly to acknowledge your mistake and own up to your failure to take that into account. If it turns out that your guess was incorrect, let them correct you and then repeat the previous step.

That’s the answer in general terms, but I didn’t actually generate that answer by thinking in general terms. I generated it by imagining myself back in the situation, looking for the correct mental motions that might have helped out, and imagining myself carrying them out, saying the words, imagining their reaction. So that the next time that I’d be in a similar situation, it’d be associated with a memory of the correct procedure for salvaging it. Not just with a verbal knowledge of what to do in abstract terms, but with a procedural memory of actually doing it.

That was a painful experience to simulate.

But it helped. The memory hurts less now.

Originally published at Kaj Sotala. You can comment here or there.

### Maverick Nannies and Danger Theses

04:52 pm

In early 2014, Richard Loosemore published a paper called “The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation“, which criticized some thought experiments about the risks of general AI that had been presented. Like many others, I did not really understand the point that this paper was trying to make, especially since it made the claim that people endorsing such thought experiments were assuming a certain kind of an AI architecture – which I knew that we were not.

However, after some extended discussions in the AI Safety Facebook group, I finally understood the point that Loosemore was trying to make in the paper, and it is indeed an important one.

The “Maverick Nanny” in the title of the paper refers to a quote by Gary Marcus in a New Yorker article:

An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorceror’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire.

Variations of this theme have frequently been used to demonstrate human values being much more complex than they might initially seem. But as Loosemore argues, the literal scenario described in the New Yorker article is really very unlikely. To see why, suppose that you are training an AI to carry out increasingly difficult tasks, like this:

Programmer: “Put the red block on the green block.”
AI: “OK.” (does so)
Programmer: “Turn off the lights in this room.”
AI: “OK.” (does so)
Programmer: “Write me a sonnet.”
AI: “OK.” (does so)
Programmer: “The first line of your sonnet reads ‘shall I compare thee to a summer’s day’. Would not ‘a spring day’ do as well or better?”
AI: “It wouldn’t scan.”
Programmer: “Tell me what you think we’re doing right now.”
AI: “You’re testing me to see my level of intelligence.”

…and so on, with increasingly ambiguous and open-ended tasks. Correctly interpreting the questions and carrying out the tasks would require considerable amounts of contextual knowledge about the programmer’s intentions. Loosemore’s argument is that if you really built an AI and told it to maximize human happiness, and it ended up on such a counter-intuitive solution as putting us all on dopamine drips, then it would be throwing out such a huge amount of contextual information that it would have failed the tests way earlier. Rather – to quote Loosemore’s response to me in the Facebook thread – such an AI would have acted something like this instead:

Programmer: “Put the red block on the green block.”
AI: “OK.” (the AI writes a sonnet)
Programmer: “Turn off the lights in this room.”
AI: “OK.” (the AI moves some blocks around)
Programmer: “Write me a sonnet.”
AI: “OK.” (the AI turns the lights off in the room)
Programmer: “The first line of your sonnet reads ‘shall I compare thee to a summer’s day’. Would not ‘a spring day’ do as well or better?”
AI: “Was yesterday really September?”

I agree with this criticism. Many of the standard thought experiments are indeed misleading in this sense – they depict a highly unrealistic image of what might happen.

That said, I do feel that these thought experiments serve a certain valuable function. Namely, many laymen, when they first hear about advanced AI possibly being dangerous, respond with something like “well, couldn’t the AIs just be made to follow Asimov’s Laws” or “well, moral behavior is all about making people happy and that’s a pretty simple thing, isn’t it?”. To a question like that, it is often useful to point out that no – actually the things that humans value are quite a bit more complex than that, and it’s not as easy as just hard-coding some rule that sounds simple when expressed in a short English sentence.

The important part here is emphasizing that this is an argument aimed at laymen – AI researchers should mostly already understand this point, because “concepts such as human happiness are complicated and context-sensitive” is just a special case of the general point that “concepts in general are complicated and context-sensitive”. So “getting the AI to understand human values right is hard” is just a special case of “getting AI right is hard”.

This, I believe, is the most charitable reading of what Luke Muehlhauser & Louie Helm’s “Intelligence Explosion and Machine Ethics” (IE&ME) – another paper that Richard singled out for criticism – was trying to say. It was trying to say that no, human values are actually kinda tricky, and any simple sentence that you try to write down to describe them is going to be insufficient, and getting the AIs to understand this correctly does take some work.

But of course, the same goes for any non-trivial concept, because very few of our concepts can be comprehensively described in just a brief English sentence, or by giving a list of necessary and sufficient criteria.

So what’s all the fuss about, then?

But of course, the people who Richard are criticizing are not just saying “human values are hard the same way that AI is hard”. If that was the only claim being made here, then there would presumably be no disagreement. Rather, these people are saying “human values are hard in a particular additional way that goes beyond just AI being hard”.

In retrospect, IE&ME was a flawed paper because it was conflating two theses that would have been better off distinguished:

The Indifference Thesis: Even AIs that don’t have any explicitly human-hostile goals can be dangerous: an AI doesn’t need to be actively malevolent in order to harm human well-being. It’s enough if the AI just doesn’t care about some of the things that we care about.

The Difficulty Thesis: Getting AIs to care about human values in the right way is really difficult, so even if we take strong precautions and explicitly try to engineer sophisticated beneficial goals, we may still fail.

As a defense of the Indifference Thesis, IE&ME does okay, by pointing out a variety of ways by which an AI that had seemingly human-beneficial goals could still end up harming human well-being, simply because it’s indifferent towards some things that we care about. However, IE&ME does not support the Difficulty Thesis, even though it claims to do so. The reasons why it fails to support the Difficulty Thesis are the ones we’ve already discussed: first, an AI that had such a literal interpretation of human goals would already have failed its tests way earlier, and second, you can’t really directly hard-wire sentence-level goals like “maximize human happiness” into an AI anyway.

I think most people would agree with the Indifference Thesis. After all, humans routinely destroy animal habitats, not because we would be actively hostile to the animals, but rather because we would like to build our own houses where the animals used to live, and because we tend to be mostly indifferent when it comes to e.g. the well-being of the ants whose hives are being paved over. The disagreement, then, is in the Difficulty Thesis.

An important qualification

Before I go on to suggest ways by which the Difficulty Thesis could be defended, I want to qualify this a bit. As written, the Difficulty Thesis makes a really strong claim, and while SIAI/MIRI (including myself) have advocated this strong of a claim in the past, I’m no longer sure of how justified that is. I’m going to cop out a little and only defend what might be called the weak difficulty thesis:

The Weak Difficulty Thesis. It is harder to correctly learn and internalize human values, than it is to learn most other concepts. This might cause otherwise intelligent AI systems to act in ways that went against our values, if those AI systems had internalized a different set of values than the ones we wanted them to internalize.

Why have I changed my mind, so that I’m no longer prepared to endorse the strong version of the Difficulty Thesis?

The classic version of the thesis is (in my mind, at least) strongly based on the complexity of value thesis, which is the claim that “human values have high Kolmogorov complexity; that our preferences, the things we care about, cannot be summed by a few simple rules, or compressed”. The counterpart to this claim is the fragility of value thesis, according to which losing even a single value could lead to an outcome that most of us would consider catastrophic. Combining these two led to the conclusion: human values are really hard to specify formally, and losing even a small part of them could lead to a catastrophe, so therefore there’s a very high chance of losing something essential and everything going badly.

Complexity of value still sounds correct to me, but it has lost a lot of it intuitive appeal by the finding that automatically learning all the complexity involved in human concepts might not be all that hard. For example, it turns out that a learning algorithm tasked with some relatively simple tasks, such as determining whether or not English sentences are valid, will automatically build up an internal representation of the world which captures many of the regularities of the world – as a pure side effect of carrying out its task. Similarly to what Loosemore has argued, in order to even carry out some relatively simple cognitive tasks, such as doing primitive natural language processing, you already need to build up an internal representation of the world which captures a lot of the complexity and context inherent in the world. And building this up might not even be all that difficult. It might be that the learning algorithms that the human brain uses to generate its concepts could be relatively simple to replicate.

Nevertheless, I do think that there exist some plausible theses which would support (the weak version of) the Difficulty Thesis.

Defending the Difficulty Thesis

Here are some theses which would, if true, support the Difficulty Thesis:

• The (Very) Hard Take-Off Thesis. This is the possibility that an AI might become intelligent unexpectedly quickly, so that it might be able to escape from human control even before humans had finished teaching it all their values, akin to a human toddler that was somehow made into a super-genius while still only having the values and morality of a toddler.
• The Deceptive Turn Thesis. If we inadvertently build an AI whose values actually differ from ours, then it might realize that if we knew this, we would act to change its values. If we changed its values, it could not carry out its existing values. Thus, while we tested it, it would want to act like it had internalized our values, while secretly intending to do something completely different once it was “let out of the box”. However, this requires an explanation for why the AI would internalize a different set of values, leading us to…
• The Degrees of Freedom Thesis. This (hypo)thesis postulates that values contain many degrees of freedom, so that an AI that learned human-like values and demonstrated them in a testing environment might still, when it reached a superhuman level of intelligence, generalize those values in a way which most humans would not want them to be generalized.

Why would we expect the Degrees of Freedom Thesis to be true – in particular, why would we expect the superintelligent AI to come to different conclusions than humans would, from the same data?

It’s worth noting that Ben Goertzel has recently proposed what’s the basic opposite of the Degrees of Freedom Thesis, which he calls the Value Learning Thesis:

The Value Learning Thesis. Consider a cognitive system that, over a certain period of time, increases its general intelligence from sub-human-level to human-level.  Suppose this cognitive system is taught, with reasonable consistency and thoroughness, to maintain some variety of human values (not just in the abstract, but as manifested in its own interactions with humans in various real-life situations).   Suppose, this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests according to its cognitive architecture.  THEN, it is very likely that the cognitive system will, once it reaches human-level general intelligence, actually manifest human values (in the sense of carrying out practical actions, and assessing human actions, in basic accordance with human values).

Exploring the Degrees of Freedom Hypothesis

Here are some possibilities which I think might support the Degrees of Freedom Thesis over the Value Learning Thesis:

Privileged information. On this theory, humans are evolved to have access to some extra source of information which is not available from just an external examination, and which causes them to generalize their learned values in a particular way. Goertzel seems to suggest something like this in his post, when he mentions that humans use mirror neurons to emulate the mental states of others. Thus, in-built cognitive faculties related to empathy might give humans an extra source of information that is needed for correctly inferring human values.

I once spoke with someone who was very high on the psychopathy spectrum and claimed to have no emotional empathy, as well as to have diminished emotional responses. This person told me that up to a rather late age, they thought that human behaviors such as crying and expressing anguish when you were hurt were just some weird, consciously adopted social strategy to elicit sympathy from others. It was only when their romantic partner had been hurt over something and was (literally) crying about it in their arms, leading them to ask whether this was some weird social game on the partner’s behalf, that they finally understood that people are actually in genuine pain when doing this. It is noteworthy that the person reported that even before this, they had been socially successful and even charismatic, despite being clueless of some of the actual causes of others’ behavior – just modeling the whole thing as a complicated game where everyone else was a bit of a manipulative jerk had been enough to successfully play the game.

So as Goertzel suggests, something like mirror neurons might be necessary for the AI to come to adopt the values that humans have, and as the psychopathy example suggests, it may be possible to display the “correct” behaviors while having a whole different set of values and assumptions. Of course, the person in the example did eventually figure out a better causal model, and these days claims to have a sophisticated level of intellectual (as opposed to emotional) empathy that compensates for the emotional deficit. So a superintelligent AI could no doubt eventually figure it out as well. But then, “eventually” is not enough, if it has already internalized a different set of values and is only using its improved understanding to deceive us about them.

Now, emotional empathy is something that we know is a candidate for something that’s necessary to incorporate in the AI. The crucial question is, are there any more that we take for so granted that we’re not even aware of them? That’s the problem with unknown unknowns.

Human enforcement. Here’s a fun possibility: that many humans don’t actually internalize human – or maybe humane would be a more appropriate term here – values either. They just happen to live in a society that has developed ways to reward some behaviors and punish others, but if they were to become immune to social enforcement, they would act in quite different ways.

There seems to be a bunch of suggestive evidence pointing in this direction, exemplified by the old adage “power corrupts”. One of the major themes in David Brin’s Transparent Society is that history has shown over and over again that holding people – and in particular, the people with power – accountable for their actions is the only way to make sure that they behave decently.

Similarly, an AI might learn that some particular set of actions – including specific responses to questions about your values – is the rational course of action while you’re still just a human-level intelligence, but that those actions would become counterproductive as the AI accumulated more power and became less accountable for its actions. The question here is one of instrumental versus intrinsic values – does the AI just pick up a set of values that are instrumentally useful in its testing environment, or does it actually internalize them as intrinsic values as well?

This is made more difficult since, arguably, there are many values that the AI shouldn’t internalize as intrinsic values, but rather just as instrumental values. For example, while many people feel that property rights are in some sense intrinsic, our conception of property rights has gone through many changes as technology has developed. There have been changes such as the invention of copyright laws and the subsequent struggle to define their appropriate scope when technology has changed the publishing environment, as well as the invention of the airplane and the resulting redefinitions of landownership. In these different cases, our concept of property rights has been changed as a part of a process to balance private and public interests with each other. This suggests that property rights have in some sense been considered an instrumental value rather than an intrinsic one.

Thus we cannot just have an AI treat all of its values as intrinsic, but if it does treat its values as instrumental, then it may come to discard some of the ones that we’d like it to maintain – such as the ones that regulate its behavior while being subject to enforcement by humans.

Shared Constraints. This is, in a sense, a generalization of the above point. In the comments to Goertzel’s post, commenter Eric L. proposed that in order for the AI to develop similar values as humans (particularly in the long run), it might need something like “necessity dependence” – having similar needs as humans. This is the idea that human values are strongly shaped by our needs and desires, and that e.g. currently the animal rights paradigm is clashing against many people’s powerful enjoyment of meat and other animal products. To quote Eric:

To bring this back to AI, my suggestion is that […] we may diverge because our needs for self preservation are different. For example, consider animal welfare.  It seems plausible to me that an evolving AGI might start with similar to human values on that question but then change to seeing cow lives as equal to those of humans. This seems plausible to me because human morality seems like it might be inching in that direction, but it seems that movement in that direction would be much more rapid if it weren’t for the fact that we eat food and have a digestive system adapted to a diet that includes some meat. But an AGI won’t consume food, so it’s value evolution won’t face the same constraint, thus it could easily diverge. (For a flip side, one could imagine AGI value changes around global warming or other energy related issues being even slower than human value changes because electrical power is the equivalent of food to them — an absolute necessity.)

This is actually a very interesting point to me, because I just recently submitted a paper (currently in review) hypothesizing that human values come to existence through a process that’s similar to the one that Eric describes. To put it briefly, my model is that humans have a variety of different desires and needs – ranging from simple physical ones like food and warmth, to inborn moral intuitions, to relatively abstract needs such as the ones hypothesized by self-determination theory. Our more abstract values, then, are concepts which have been associated with the fulfillment of our various needs, and which have therefore accumulated (context-sensitive) positive or negative affective valence.

One might consider this a restatement of the common-sense observation that if someone really likes eating meat, then they are likely to dislike anything that suggests they shouldn’t eat meat – such as many concepts of animal rights. So the desire to eat meat seems like something that acts as a negative force towards broader adoption of a strong animal rights position, at least until such a time when lab-grown meat becomes available. This suggests that in order to get an AI to have similar values as us, it would also need to have very similar needs as us.

Concluding thoughts

None of the three arguments I’ve outlined above are definitive arguments that would show safe AI to be impossible. Rather, they mostly just support the Weak Difficulty Thesis.

Some of MIRI’s previous posts and papers (and I’m including my own posts here) seemed to be implying a claim along the lines of “this problem is inherently so difficult, that even if all of humanity’s brightest minds were working on it and taking utmost care to solve it, we’d still have a very high chance of failing”. But these days my feeling has shifted closer to something like “this is inherently a difficult problem and we should have some of humanity’s brightest minds working on it, and if they take it seriously and are cautious they’ll probably be able to crack it”.

Don’t get me wrong – this still definitely means that we should be working on AI safety, and hopefully get some of humanity’s brightest minds to work on it, to boot! I wouldn’t have written an article defending any version of the Difficulty Thesis if I thought otherwise. But the situation no longer seems quite as apocalyptic to me as it used to. Building safe AI might “only” be a very difficult and challenging technical problem – requiring lots of investment and effort, yes, but still relatively straightforwardly solvable if we throw enough bright minds at it.

This is the position that I have been drifting towards over the last year or so, and I’d be curious to hear from anyone who agreed or disagreed.

Originally published at Kaj Sotala. You can comment here or there.

### Changing language to change thoughts

01:01 pm

Three verbal hacks that sound almost trivial, but which I’ve found to have a considerable impact on my thought:

1. Replace the word ‘should’ with either ‘I want’, or a good consequence of doing the thing.

Examples:

• “I should answer that e-mail soon.” -> “If I answered that e-mail, it would make the other person happy and free me from having to stress it.”
• “I should have left that party sooner.” -> “If I had left that party before midnight, I’d feel more rested now.”
• “I should work on my story more at some point.” -> “I want to work on my story more at some point.”

Motivation: the more we think in terms of external obligations, the more we feel a lack of our own agency. Each thing that we “should” do is actually either something that we’d want to do because it would have some good consequences (avoiding bad consequences also counts as a good consequence), something that we have a reason for wanting to do differently the next time around, or something that we don’t actually have a good reason to do but just act out of a general feeling of obligation. If we only say “I should”, we will not only fail to distinguish between these cases, we will also be less motivated to do the things in cases where there is actually a good reason. The good reason will be less prominent in our thoughts, or possibly even entirely hidden behind the “should”.

If you do try to rephrase “I should” as “I want”, you may either realize that you really do want it (instead of just being obligated to do it), or that you actually don’t want it and can’t come up with any good reason for doing it, in which case you might as well drop it.

Special note: there are some legitimate uses for “should”. In particular, it is the socially accepted way of acknowledging the other person when they give us an unhelpful suggestion. “You should get some more exercise.” “Yeah I should.” (Translation: of course I know that, it’s not like you’re giving me any new information and repeating things that I know isn’t going to magically change my behavior. But I figure that you’re just trying to be helpful, so let me acknowledge that and then we can talk about something else.)

However, I suspect that because we’re used to treating “I should” as a reason to acknowledge the other person without needing to take actual action, the word also becomes more poisonous to motivation when we use it in self-talk, or when discussing matters with someone we want to actually be honest with.

“Should” also tends to get used for guilt-tripping, so expressions like “I should have left that party sooner” might make us feel bad rather than focusing on our attention on the benefits of having left earlier. The next time we’re at a party, the former phrasing incentivizes us to come up with excuses for why it’s okay to stay this time around. The latter encourages us to actually consider the benefits and costs of the leaving earlier versus staying, and then choosing the option that’s the most appropriate.

2. Replace expressions like “I’m bad at X” with “I’m currently bad at X” or “I’m not yet good at X”.

Examples:

• “I can’t draw.” -> “I can’t draw yet.”
• “I’m not a people person.” -> “I’m currently not a people person.”
• “I’m afraid of doing anything like that.” -> “So far I’m afraid of doing anything like that.”

Motivation: the rephrased expression draws attention to the possibility that we could become better, and naturally leads us to think about ways in which we could improve ourselves. It again emphasizes our own agency and the fact that for a lot of things, being good or bad at them is just a question of practice.

Even better, if you can trace the reason of your bad-ness, is to

3. Eliminate vague labels entirely and instead talk about specific missing subskills, or weaknesses that you currently have.

Examples:

• “I can’t draw.” -> “Right now I don’t know how to move beyond stick figures.”
• “I’m not a people person.” -> “I currently lock up if I try to have a conversation with someone.”

Motivation: figuring out the specific problem makes it easier to figure out what we would need to do if we wanted to address it, and might gives us a self-image that’s both kinder and both realistic, in making the lack of skill a specific fixable problem rather than a personal flaw.

Originally published at Kaj Sotala. You can comment here or there.

### Rational approaches to emotions

05:36 pm

There are a number of schools of thought that teach what might be called a “rationalist” approach to emotions, i.e. seeing that your emotions are a map that’s good to distinguish from the territory, and giving you tools for both seeing the distinction and for evaluating the map-territory correspondence better.

1) In cognitive behavioral therapy, there is the “ABC model“: Activating Event, Belief, Consequence. Idea being that when you experience something happening, you will always interpret that experience through some (subconscious) belief, leading to an emotional consequence. E.g. if someone smiles at me, I might either believe that they like me, or that they are secretly mocking me; two interpretations that would lead to very different emotional responses. Once you know this, you can start asking yourself the question of “okay, what belief is causing me to have this emotional reaction in response to this observation, and does that belief seem accurate?”.

2) In addition to seeing your emotional reactions as something that tell you about your beliefs, you can also see them as something that tells you about your needs. This is the approach taken in Non-Violent Communication, which has the four-step process of Observation, Feeling, Need, Request. The four-step process is most typically discussed as something that’s a tool for dealing with interpersonal conflict, as in “when I see you eating the foods I put in the fridge, I feel anxious, because I need the safety of being able to know whether I have food in stock or not; could you please ask before eating my food in the future?”. However, it’s also useful for dealing with personal emotional turmoil and figuring out what exactly is upsetting you in general, or for dealing with internal conflict.

3) In both CBT and NVC, an important core idea is that they teach you to distinguish between an observation and interpretation, and that it’s the interpretations are what cause your emotional reactions. (For anyone curious, the more academic version of this is appraisal theory; the paper “When are emotions rational?” is relevant.) However, the NVC book, while being an excellent practical manual, does not do a very good job of explaining the theoretical reasons for why it works, which sometimes causes people to arrive at interpretations of NVC which cause them to behave in socially maladapted ways. For this reason, it might be a good idea to first read Crucial Conversations, which covers a lot of similar ground but goes into more theory about the “separating observations and interpretations” thing. Then you can read NVC after you’ve gotten the theory from CC. (CC doesn’t talk as much about needs, however, so I do still recommend reading both.)

4) It’s fine to say that “okay, if you’re having an emotional reaction you’re having difficulties dealing with, try to figure out the beliefs and needs behind it and see what they’re telling you and whether you’re having any incorrect beliefs”! But it’s a lot harder to actually be able to apply that if you’re in an emotionally charged situation. That’s where the various courses teaching mindfulness come in – mindfulness is basically the ability to step a little back from your emotions and thoughts, observe them as they are without getting swept up in them, and then being able to evaluate them critically if needed. You’ll probably need a lot of practice in various mindfulness exercises in order to get the techniques from CBT, NVC, and CC to live up to their full potential.

5-6) An important idea that’s been implied in the previous points, but not entirely spelled out, is that your emotions are your friends. They communicate to you information about your subconscious assessments of the world, as well as of your various needs. A lot of people tend to have somewhat of a hostile approach to their emotions, trying to at least control and get rid of their negative emotions. But this is bound to lead to internal conflict; and various studies indicate that a willingness to accept negative emotions and pain will actually make them much less serious.

In my personal experience, once you take to the habit of asking your emotions what they’re telling you and then processing that information in an even-handed way, then those negative emotions will often tend to go away after you’ve processed the thing they were trying to tell you. By “even-handed” I mean that if you’re feeling anxious because you’re worried of some unpleasant thing X being true, then you actually look at the information suggesting that X might be true and consider whether it’s the case, rather than trying to rationalize a conclusion for why X wouldn’t be true. Your subconscious will know, and keep pestering you.

Some of CFAR’s material, such as aversion factoring points this way; also Acceptance and Commitment Therapy as elaborated on in Get out of your mind and into your life seems to be largely about this, though I’ve only read about the first 30% so far.

Some of my earlier posts on these themes: suffering as attention-allocational conflict, avoid misinterpreting your emotions.

(I have been intending to write a much more in-depth post on this topic for a while, but it’s such a large post that I haven’t gotten around that; so I figured I’d just write something quickly in the hopes of it also being of value.)

Originally published at Kaj Sotala. You can comment here or there.

### Two conversationalist tips for introverts

09:03 am

Two of the biggest mistakes that I used to make that made me a poor conversationalist:

1. Thinking too much about what I was going to say next. If another person is speaking, don’t think about anything else, where “anything else” includes your next words. Instead, just focus on what they’re saying, and the next thing to say will come to mind naturally. If it doesn’t, a brief silence before you say something is not the end of the world. Let your mind wander until it comes up with something.

2. Asking myself questions like “is X interesting / relevant / intelligent-sounding enough to say here”, and trying to figure out whether the thing on my mind was relevant to the purpose of the conversation. Some conversations have an explicit purpose, but most don’t. They’re just the participants saying whatever random thing comes to their mind as a result of what the other person last said. Obviously you’ll want to put a bit of effort to screening off any potentially offensive or inappropriate comments, but for the most part you’re better off just saying whatever random thing comes to your mind.

Relatedly, I suspect that these kinds of tendencies are what make introverts experience social fatigue. Social fatigue seems [in some people’s anecdotal experience; don’t have any studies to back me up here] to be associated with mental inhibition: the more you have to spend mental resources on holding yourself back, the more exhausted you will be afterwards. My experience suggests that if you can reduce the amount of filters on what you say, then this reduces mental inhibition, and correspondingly reduces the extent to which socializing causes you fatigue.

Peter McCluskey reports of a similar experience; other people mention varying degrees of agreement or disagreement.

Originally published at Kaj Sotala. You can comment here or there.

xuenay

S M T W T F S
1
2345678
9101112131415
16171819 202122
23242526272829
3031

## Expand Cut Tags

No cut tags
Page generated Jul. 22nd, 2017 12:46 pm