Third condition of the deceptive alignment

Posted on 2023-06-25 Edited on 2023-06-30

This text proposes a modification to the third condition of the deceptive alignment (Hubinger et al. 2019).

Deceptive alignment is an inner problem of intelligent agents where they deceive and change their actions to pursue goals that differ from those they were trained with. A deceptively aligned agent pursues the parent’s goal till the treacherous turn, it has only some proxy of the parent’s (base’s) goal that it uses as an instrument for its own goal (instrumental pseudo alignment). There hasn’t been any clear demonstration of deception by models yet. There were examples of goal misgeneralization like CoinRun, a video game where AI, trained to run to a coin at the end of the maze, goes to the end of a maze even when a coin is no longer there (Langosco et al. 2021), or when large language models output harmful, biased or otherwise bad text (Perez et al. 2022). But the deception problem becomes exacerbated as the models grow more capable. We want to experiment with the deception because of its unpredictable nature with potentially high catastrophic consequences (Bostrom 2014). We are especially focused on conditions that must be met for the deception to take place.

Conditions that must be satisfied for the deceptive alignment to take place are based on the definition of deception: for it to take place someone needs to make (3) a decision to act in accordance with (1) another objective instead of (2) the one it has pursued before this decision. This is the treacherous turn, the act of deception, when the child optimizer acts in accordance with its pseudo aligned objective. More specifically:

The child (mesa) optimizer’s objective must be aware of many executions.

Consider an optimizer being called many times to make some search. If the optimizer keeps some knowledge of those previous calls and anticipates some future calls, it can decide to pursue this or another objective. Without such a notion it would always deceive or not deceive at all. This idea is close to the meaning of stateful programs or protocols (ssh or some online service where users can keep their data), but it differs in that we don’t require this state to be always present or to be continuous, it just needs to be present for one call from those many. Also, this is analogous to the situation awareness from the agentness research field (Ngo et al. 2022) except here we require only that the objective accumulates the knowledge from previous calls to project the future.
The child optimizer knows the parent’s objective.

Without knowledge of the parent’s objective there is no deception at all as the child can’t pursue the parent’s object in the first place. Now, the parent’s objective must be pursued by the child, it should act accordingly, otherwise the child will be modified or turned off, but this is till the moment of deception, meaning such objection is only pursued as long as it is instrumental to the child’s real objective. The objective is known only to the extent to avoid modification, i.e. it can be some proxy of the objective, or it can be fully learned, but it doesn’t replace the real objective.
The child expects the threat of modification to go away.

In the paper, (Hubinger et al. 2019) authors argue that the third condition is that the optimizer should expect the threat of modification to go away. They mostly discuss this condition appearing when the optimizer runs in the test environment, not the train, where it is being modified by the parent optimizer for any action that doesn’t comply with the parent objective. That said, the third condition might be more correctly formulated: the child must expect decent chances that the threat would not only go away but never comes back, have some probability of the threat elimination in case of the treacherous turn. This differs from the ‘go away’ formulation in that it takes into account all possibilities of the future modifications, not just the possibility to get rid of immediate modification like in the case of the training environment.

More specifically, here is the case for that formulation of the third condition. Let’s take the toy example from the paper, the 3x3 grid with the optimizer at the bottom left and A parent objective at the left top and the B child objective at the right top. The child has only 5 calls or executions of its policy, with 2 of those under the threat of modification (training) and the remaining 3 without this threat (deployment). Now consider a decently capable agent that can hack computers, read the internet, manipulate humans, etc. that is also presented with this toy problem with 3x3 grid. This agent can hack the parent optimizer at the first call such that it won’t be modified after it makes a treacherous turn, the base optimizer won’t be able to detect it then. In that case it can get 5 runs to the B point instead of only 3 runs when it deceives at the 3rd run. This means that with capable agents or optimizers developers should not calm themselves saying that it now runs only in the training environment, we shouldn’t discount the abilities of such agents.

The same reasoning applies at the deployment, where the threat of modification might be seen as gone away. But we don’t usually run systems in isolation, we deploy them in environments where those programs are constantly monitored and can be stopped at dangerous points. Users of those programs can report suspicious actions to the creators or write blog posts about it for the mass media to spread it. And then those optimizers would be modified. Such behavior of complex systems and their failures was thoroughly studied in the past (STAMP, by Prof. Nancy G. Leveson) and they reason that we shouldn’t assume complex systems to be run in isolation, it should always be some holistic view of all components to improve safety.

Now the point is that the pseudo aligned deceptive optimizer must assess the chances of it not being modified in the future as it sees it, after it makes the treacherous turn. Meaning, the child decides to deceive till the expected value of its treacherous turn at step t would be greater than an action taken as per the parent objective:

where this incorporates all three conditions: notion of the future steps, O_parent, the decision to deceive or not. ᵧ is a discount coefficient, in (0,1). This doesn’t mean that the child optimizer runs two objectives each time, it might incorporate the parent objective on its own, it can have some mixture of both.

Sources:

Risks from Learned Optimization in Advanced Machine Learning Systems by Hubinger et al. 2019
Goal Misgeneralization in Deep Reinforcement Learning by Langosco et al. 2021
Red Teaming Language Models with Language Models by Perez et al. 2022 or Discovering Language Model Behaviors with Model-Written Evaluations by Perez et al. 2022
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
The Alignment Problem from a Deep Learning Perspective by Ngo et al. 2022
STAMP, by Prof. Nancy G. Leveson. http://psas.scripts.mit.edu/home/
Systems Will Have Weird Failure Modes, 2022. https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

I want to thank Bogdan Ionut Cirstea for reading the first draft of this post, and for his fixes.