### Desiderata for a model of human values

06:26 pm

Soares (2015) defines the value learning problem as

By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?

There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering.

To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions.

By “human values”, I here basically mean the values of any given individual: we are not talking about the values of, say, a whole culture, but rather just one person within that culture. While the problem of aggregating or combining the values of many different individuals is also an important one, we should probably start from the point where we can understand the values of just a single person, and then use that understanding to figure out what to do with conflicting values.

In order to make the purpose of this exercise as clear as possible, let’s start with the most important desideratum, of which all the others are arguably special cases of:

1. Useful for AI safety engineering. Our model needs to be useful for the purpose of building AIs that are aligned with human interests, such as by making it possible for an AI to evaluate whether its model of human values is correct, and by allowing human engineers to evaluate whether a proposed AI design would be likely to further human values.

In the context of AI safety engineering, the main model for human values that gets mentioned is that of utility functions. The one problem with utility functions that everyone always brings up, is that humans have been shown not to have consistent utility functions. This suggests two new desiderata:

2. Psychologically realistic. The proposed model should be compatible with that which we know about current human values, and not make predictions about human behavior which can be shown to be empirically false.

3. Testable. The proposed model should be specific enough to make clear predictions, which can then be tested.

As additional requirements related to the above ones, we may wish to add:

4. Functional. The proposed model should be able to explain what the functional role of “values” is: how do they affect and drive our behavior? The model should be specific enough to allow us to construct computational simulations of agents with a similar value system, and see whether those agents behave as expected within some simulated environment.

5. Integrated with existing theories. The proposed definition model should, to as large an extent possible, fit together with existing knowledge from related fields such as moral psychology, evolutionary psychology, neuroscience, sociology, artificial intelligence, behavioral economics, and so on.

However, I would argue that as a model of human value, utility functions also have other clear flaws. They do not clearly satisfy these desiderata:

6. Suited for modeling internal conflicts and higher-order desires. A drug addict may desire a drug, while also desiring that he not desire it. More generally, people may be genuinely conflicted between different values, endorsing contradictory sets of them given different situations or thought experiments, and they may struggle to behave in a way in which they would like to behave. The proposed model should be capable of modeling these conflicts, as well as the way that people resolve them.

7. Suited for modeling changing and evolving values. A utility function is implicitly static: once it has been defined, it does not change. In contrast, human values are constantly evolving. The proposed model should be able to incorporate this, as well as to predict how our values would change given some specific outcomes. Among other benefits, an AI whose model of human values had this property might be able to predict things that our future selves would regret doing (even if our current values approved of those things), and warn us about this possibility in advance.

8. Suited for generalizing from our existing values to new ones. Technological and social change often cause new dilemmas, for which our existing values may not provide a clear answer. As a historical example (Lessig 2004), American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question – could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? In answer to this question, the concept of landownership was redefined to only extend a limited, and not an indefinite, amount upwards. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. Our model of value should be capable of figuring out such compromises, rather than treating values such as landownership as black boxes, with no understanding of why people value them.

As an example of using the current criteria, let’s try applying them to the only paper that I know of that has tried to propose a model of human values in an AI safety engineering context: Sezener (2015). This paper takes an inverse reinforcement learning approach, modeling a human as an agent that interacts with its environment in order to maximize a sum of rewards. It then proposes a value learning design where the value learner is an agent that uses Solomonoff’s universal prior in order to find the program generating the rewards, based on the human’s actions. Basically, a human’s values are equivalent to a human’s reward function.

Let’s see to what extent this proposal meets our criteria.

1. Useful for AI safety engineering. To the extent that the proposed model is correct, it would clearly be useful. Sezener provides an equation that could be used to obtain the probability of any given program being the true reward generating program. This could then be plugged directly into a value learning agent similar to the ones outlined in Dewey (2011), to estimate the probability of its models of human values being true. That said, the equation is incomputable, but it could be possible to construct computable approximations.
2. Psychologically realistic. Sezener assumes the existence of a single, distinct reward process, and suggests that this is a “reasonable assumption from a neuroscientific point of view because all reward signals are generated by brain areas such as the striatum”. On the face of it, this seems like an oversimplification, particularly given evidence suggesting the existence of multiple valuation systems in the brain. On the other hand, since the reward process is allowed to be arbitrarily complex, it could be taken to represent just the final output of the combination of those valuation systems.
3. Testable. The proposed model currently seems to be too general to be accurately tested. It would need to be made more specific.
4. Functional. This is arguable, but I would claim that the model does not provide much of a functional account of values: they are hidden within the reward function, which is basically treated as a black box that takes in observations and outputs rewards. While a value learner implementing this model could develop various models of that reward function, and those models could include internal machinery that explained why the reward function output various rewards at different times, the model itself does not make any assumptions of this.
5. Integrated with existing theories. Various existing theories could in principle used to flesh out the internals of the reward function, but currently no such integration is present.
6. Suited for modeling internal conflicts and higher-order desires. No specific mention of this is made in the paper. The assumption of a single reward function that assigns a single reward for every possible observation seems to implicitly exclude the notion of internal conflicts, with the agent always just maximizing a total sum of rewards and being internally united in that goal.
7. Suited for modeling changing and evolving values. As written, the model seems to consider the reward function as essentially unchanging: “our problem reduces to finding the most probable $p_R$ given the entire action-observation history $a_1o_1a_2o_2 . . . a_no_n$.”
8. Suited for generalizing from our existing values to new ones. There does not seem to be any obvious possibility for this in the model.

I should note that despite its shortcomings, Sezener’s model seems like a nice step forward: like I said, it’s the only proposal that I know of so far that has even tried to answer this question. I hope that my criteria would be useful in spurring the development of the model further.

As it happens, I have a preliminary suggestion for a model of human values which I believe has the potential to fulfill all of the criteria that I have outlined. However, I am far from certain that I have managed to find all the necessary criteria. Thus, I would welcome feedback, particularly including proposed changes or additions to these criteria.

Originally published at Kaj Sotala. You can comment here or there.

### DeepDream: Today psychedelic images, tomorrow unemployed artists

04:26 pm

One interesting thing that I noticed about Google’s DeepDream algorithm (which you might also know as “that thing making all pictures look like psychedelic trips“) is that it seems to increase the image quality. For instance, my current Facebook profile picture was ran through DD and looks sharper than the original, which was relatively fuzzy and grainy.

Me, before and after drugs.

If you know how DD works, this is not too surprising in retrospect. The algorithm, similar to the human visual system, works by first learning to recognize simple geometric shapes, such as (possibly curvy) lines. Then it learns higher-level features combining those lower-level features, like learning that you can get an eyeball by combining lines in a certain way. The DD algorithm looks for either low- or high-level features and strengthens them.

Lines in a low-quality image are noisy versions of lines in a high-quality image. The DD algorithm has learned to “know” what lines “should” look like, so if you run it on the low-level setting, it takes anything possible that could be interpreted as a high-quality (possibly curvy) line and makes it one. Of course, what makes this fun is that it’s overly aggressive and also adds curvy lines that shouldn’t actually be there, but it wouldn’t necessarily need to do that. Probably with the right tweaking, you could make it into a general purpose image quality enhancer.

A very good one, since it wouldn’t be limited to just using the information that was actually in the image. Suppose you gave an artist a grainy image of a church, and asked them to draw something using that grainy picture as a reference. They could use that to draw a very detailed and high-quality picture of a church, because they would have seen enough churches to imagine what the building in the grainy image should look like in real life. A neural net trained on a sufficiently large dataset of images would effectively be doing the same.

Suddenly, even if you were using a cheap and low-quality camera to take your photos, you could make them all look like high-quality ones. Of course, the neural net might be forced to invent some details, so your processed photos might differ somewhat from actual high-quality photos, but it would often be good enough.

But why stop there? We’ve already established that the net could use its prior knowledge of the world to fill in details that aren’t necessarily in the original picture. After all, it’s doing that with all the psychedelic pictures. The next version would be a network that could turn sketches into full-blown artwork.

Just imagine it. Maybe you’re making a game, and need lots of art for it, but can’t afford to actually pay an artist. So you take a neural net, feed to it a large dataset of the kind of art you want. Then you start making sketches that aren’t very good, but are at least recognizable as elven rangers or something. You give that to the neural net and have it fill in the details and correct your mistakes, and there you go!

If NN-generated art would always have distinctive recognizable style, it’d probably quickly become seen as cheap and low status, especially if it wasn’t good at filling in the details. But it might not acquire that signature style, depending on how large of a dataset was actually needed for training it. Currently deep learning approaches tend to require very large datasets, but as time goes on, possibly you could do with less. And then you could get an infinite amount of different art styles, simply by combining any number of artists or art styles to get a new training set, feeding that to a network, and getting a blend of their styles to use. Possibly people might get paid doing nothing but just looking for good combinations of styles, and then selling the trained networks.

Using neural nets to generate art would be limited to simple 2D images at first, but you could imagine it getting to the point of full-blown 3D models and CGI eventually.

And yes, this is obviously going to be used for porn as well. Here’s a bit of a creepy thing: nobody will need to hack the iCloud accounts of celebrities in order to get naked pictures of them anymore. Just take the picture of any clothed person, and feed it to the right network, and it’ll probably be capable of showing you what that picture would look like if the person was naked. Or associated with one of any number of kinks and fetishes.

It’s interesting that for all the talk about robots stealing our jobs, we were always assuming that the creative class would basically be safe. Not necessarily so.

How far are we from that? Hard to tell, but I would expect at least the image quality enhancement versions to pop up very soon. Neural nets can already be trained on text corpuses and generate lots of novel text that almost kind of makes sense. Magic cards, too. I would naively guess image enhancement to be an easier problem than actually generating sensible text (which is something that seems AI-complete). And we just got an algorithm that can take two images of a scene and synthesize a third image from a different point of view, to name just the latest fun image-related result from my news feed. But then I’m not an expert on predicting AI progress (few if any people are), so we’ll see.

EDITED TO ADD: On August 28th, less than two months after the publication of this article, the news broke of an algorithm that could learn to copy the style of an artist.

Originally published at Kaj Sotala. You can comment here or there.

### Applied cognitive science: learning from a faux pas

01:12 pm
Yesterday evening, I pasted to two IRC channels an excerpt of what someone had written. In the context of the original text, that excerpt had seemed to me like harmless if somewhat raunchy humor. What I didn't realize at the time was that by removing the context, the person writing it came off looking like a jerk, and by laughing at it I came off looking as something of a jerk as well.

Two people, both of whom I have known for many years now and whose opinions I value, approached me by private message and pointed out that that may not have been the smartest thing to do. My initial reaction was defensive, but I soon realized that they were right and thanked them for pointing it out to me. Putting on a positive growth mindset, I decided to treat this event as a positive one, as in the future I'd know better.

Later that evening, as I lay in bed waiting to fall asleep, the episode replayed itself in my mind. I learnt long ago that trying to push such replays out of my mind would just make them take longer and make them feel worse. So I settled back to just observing the replay and waiting for it to go away. As I waited, I started thinking about what kind of lower-level neural process this feeling might be a sign of.

Artificial neural networks use what is called a backpropagation algorithm to learn from mistakes. First the network is provided some input, then it computes some value, and then the obtained value is compared to the expected value. The difference between the obtained and expected value is the error, which is then propagated back from the end of the network to the input layer. As the error signal works it way through the network, neural weights are adjusted in such a fashion to produce a different output the next time.

Backprop is known to be biologically unrealistic, but there are more realistic algorithms that work in a roughly similar manner. The human brain seems to be using something called temporal difference learning. As Roko described it: "Your brain propagates the psychological pain 'back to the earliest reliable stimulus for the punishment'. If you fail or are punished sufficiently many times in some problem area, and acting in that area is always preceeded by [doing something], your brain will propagate the psychological pain right back to the moment you first begin to [do that something]".

As I lay there in bed, I couldn't help the feeling that something similar to those two algorithms was going on. The main thing that kept repeating itself was not the actual action of pasting the quote to the channel or laughing about it, but the admonishments from my friends. Being independently rebuked for something by two people I considered important: a powerful error signal that had to be taken into account. Their reactions filling my mind: an attempt to re-set the network to the state it was in soon after the event. The uncomfortable feeling of thinking about that: negative affect flooding the network as it was in that state, acting as a signal to re-adjust the neural weights that had caused that kind of an outcome.

I tend to be somewhat amused by the people who go about claiming that computers can never be truly intelligent, because a computer doesn't genuinely understand the information it's processing. I think they're vastly overestimating how smart we are, and that a lot of our thinking is just relatively crude pattern-matching, with various patterns (including behavioral ones) being labeled as good or bad after the fact, as we try out various things.

On the other hand, there would probably have been one way to avoid that incident. We do have the capacity for reflective thought, which allows us to simulate various events in our heads without needing to actually undergo them. Had I actually imagined the various ways in which people could interpret that quote, I would probably have relatively quickly reached the conclusion that yes, it might easily be taken as jerk-ish. Simply imagining that reaction might then have provided the decision-making network with a similar, albeit weaker, error signal and taught it not to do that.

However, there's the question of combinatorial explosions: any decision could potentially have countless of consequences, and we can't simulate them all. (See the epistemological frame problem.) So in the end, knowing the answer to the question of "which actions are such that we should pause to reflect upon their potential consequences" is something we need to learn by trial and error as well.

So I guess the lesson here is that you shouldn't blame yourself too much if you've done something that feels obviously wrong in retrospect. That decision was made by an earlier version of you. Although it feels obvious now, that version of you might literally have had no way of knowing that it was making a mistake, as it hadn't been properly trained yet.

### The Concepts Problem

11:24 pm
Cross-posted to Less Wrong.

A mind can only represent a complex concept X by embedding it into a tightly intervowen network of other concepts that combine to give X its meaning. For instance, a "cat" is playful, four-legged, feline, a predator, has a tail, and so forth. These are the concepts that define what it means to be a cat; by itself, "cat" is nothing but a complex set of links defining how it relates to these other concepts. (As well as a set of links to memories about cats.) But then, none of those concepts means anything in isolation, either. A "predator" is a specific biological and behavioral class, the members of which hunt other animals for food. Of that definition, "biological" pertains to "biology", which is a "natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy". "Behavior", on the other hand, "refers to the actions of an organism, usually in relation to the environment". Of those words... and so on.

It does not seem likely that humans could preprogram an AI with a ready-made network of concepts. There have been attempts to build knowledge ontologies by hand, but any such attempt is both hopelessly slow and lacking in much of the essential content. Even given a lifetime during which to work and countless of assistants, could you ever hope to code everything you knew into a format from which it was possible to employ that knowledge usefully? Even a worse problem is that the information would need to be in a format compatible with the AI's own learning algorithms, so that any new information the AI learnt would fit seamlessly to the previously-entered database. It does not seem likely that we can come up with an efficient language of thought that can be easily translated into a format that is intuitive for humans to work with.

Indeed, there are existing plans for AI systems which make the explicit assumption that the AI's network of knowledge will develop independently as the system learns, and the concepts in this network won't necessarily have an easy mapping to those used in human language. The OpenCog wikibook states that:

Some ConceptNodes and conceptual PredicateNode or SchemaNodes may correspond with human-language words or phrases like cat, bite, and so forth. This will be the minority case; more such nodes will correspond to parts of human-language concepts or fuzzy collections of human-language concepts. In discussions in this wikibook, however, we will often invoke the unusual case in which Atoms correspond to individual human-language concepts. This is because such examples are the easiest ones to discuss intuitively. The preponderance of named Atoms in the examples in the wikibook implies no similar preponderance of named Atoms in the real OpenCog system. It is merely easier to talk about a hypothetical Atom named "cat" than it is about a hypothetical Atom (internally) named [434]. It is not impossible that a OpenCog system represents "cat" as a single ConceptNode, but it is just as likely that it will represent "cat" as a map composed of many different nodes without any of these having natural names. Each OpenCog works out for itself, implicitly, which concepts to represent as single Atoms and which in distributed fashion.

Designers of Friendly AI seek to build a machine with a clearly-defined goal system, one which is guaranteed to preserve the highly complex values that humans have. But the nature of concepts poses a challenge for this objective. There seems to be no obvious way of programming those highly complex goals into the AI right from the beginning, nor to guarantee that any goals thus preprogrammed will not end up being drastically reinterpreted as the system learns. We cannot simply code "safeguard these human values" into the AI's utility function without defining those values in detail, and defining those values in detail requires us to build the AI with an entire knowledge network. On a certain conceptual level, the decision theory and goal system of an AI is separate from its knowledge base; in practice, it doesn't seem like this would be possible.

The goal might not be impossible, though. Humans do seem to be pre-programmed with inclinations towards various complex behaviors which might suggest pre-programmed concepts to various degrees. Heterosexuality is considerably more common in the population than homosexuality, though this may have relatively simple causes such as an inborn preference towards particular body shapes combined with social conditioning. (Disclaimer: I don't really know anything about the biology of sexuality, so I'm speculating wildly here.) Most people also seem to react relatively consistently to different status displays, and people have collected various lists of complex human universals. The exact method of their transmission remains unknown, however, as does the role that culture serves in it. It also bears noting that most so-called "human universals" are actually cultural as opposed to individual universals. In other words, any given culture might be guaranteed to express them, but there will always be individuals who don't fit into the usual norms.

xuenay

S M T W T F S
12
34567 89
10111213141516
171819202122 23
24252627282930

## Expand Cut Tags

No cut tags
Page generated Sep. 26th, 2017 03:40 am