About PAIS and Risks from Learned Optimization

This is summary and critique on two publications about AI Safety: Pragmatic AI Safety and Risks from Learned Optimization. They are an introductory and are aiming to direct research.

The Pragmatic AI Safety sequence

Posted on EA forum: https://forum.effectivealtruism.org/s/8EqNwueP6iw2BQpNo .

Complex Systems for AI Safety. PAIS 3

Article introduces complex systems theory and applies it to AI Safety field by naming factors that contribute to it. This addresses the third pillow in PAIS sequence (Systems view).


  • Complex systems (CS) found in AI Safety: neural nets, RL agent, AI lab, government and business, and other.
  • CSs have been well studied in Systems Theory (STAMP, cybernetics, causal models, other):
    • CS:
      • emergent properties not found in any of its parts;
      • highly interconnected (loops, indirect effects, many cause paths, interdependencies, etc.);
      • parts might be reliable, but the entire system is not;
      • modeling processes in terms of a control problem (process-controller).
    • Risk management in a CS aims to handle factors that contribute to safety. Not only cause-effect chains (finding a root cause and fixing it).
    • Systems Theory is a superset of the chain of events model (finding a root cause), not a substitute.
    • The safety factors: safety culture (most important), cost of safety features, sound alarm systems, regulations and other.
  • CS theories applied to AI Safety:
    • Safety culture should be widespread through direct and indirect measures (from policy to everyday tools).
    • Researches and organizations should invest in safety.
    • We should address causes of safety neglectedness: company politics, profit incentives (need to improve cost-benefit for safety measures), philosophical (believing in technologies), x-risks, and other.
    • Other lessons from Systems Theory:
      • A complex system develops its own goals (different from those given at the beginning) and pursues them. For AI: instrumental goals (not to be shut down).
      • We won’t be able to find where systems will fail or where its crucial parts by just observing its internal structure.
      • We should make simpler AI systems safe first, but that doesn’t guarantee safe larger AI systems.
      • We can’t rely on humans to make reliable systems as humans are unreliable.
    • Overall research effort should be diversified while each researcher shouldn’t diversify their research widely (should be an expert in one area).


  • Takes top view on the field and shows its weak places to be addressed to make it safer.
  • Inherits complex systems theory which is based on decades of research and real-life cases and applies to ai safety. For example, Systems Theory stated that internal goals comes first long before it was found in AI systems (mesa-optimization).

Discussion points:

  1. It is unclear how can we apply the safety culture in our current highly competitive environment (Google vs Facebook, China vs USA). What concretely should be the incentives or policies to adopt a safety culture? And who enforces them? If one adopts it, another will get a competitive advantage as they will spend more on capabilities and then ‘kill you’ (Yudkowsky, AGI Ruin).
  2. Extremely high stakes, i.e. x-risk. While systems theory was developed for dangerous, mission-critical systems, it didn’t deal with those systems that might disempower all humanity forever. We don’t have a second try. So no use of systems theory? It should be an iterative process, but misaligned AI would kill us in a first wrong try?
  3. Systems Theory developed for systems built by humans for humans. And humans have a certain limited intelligence level. Why is it true that it is likely that Systems Theory is applicable for a intelligence above human one?
  4. Systems Theory implies the control of a better entity on a worse entity: government issues policies to control organizations, AI lab stops researches on a dangerous path, electrician complies with instructions, etc. Now, isn’t AGI a better entity to give control to? Does it imply the humanity’s dis-empowerment? Particularly, when we introduce a moral parliament (discussed in PAIS #4) won’t it mean that now this parliament is in power, not humanity?

Perform Tractable Research While Avoiding Capabilities Externalities. PAIS 4

How to perform long tail but safe AI research? This addresses the first (ML research precedents) and second (capabilities externalities) pillows in the PAIS sequence.

  • Aim at long tail impact research (low probability, high stakes): no zero variables as long tail process multiplies rather than adds variables; research should be on tractable but unknown issues at the edge of known; start early; prepare for critical dangerous moments beforehand.
  • Research that operates near infinity variables or asymptotic (example is superintelligence) is problematic and should be avoided because they have longer feedback loops. Instead, aim for shorter feedback loops (low hanging fruits, best in terms of cost-benefit).
  • Discusses misinterpretations and misuses of Goodhart’s law. Mistake is to interpret it in a way that all measurements when became targets will inevitably result in failure to be a good measurement. Hence, all optimization systems will be misaligned as they will proxy objectives via some measurement (number of smiles). But it is more correct to say that those measurements tend to fail. And for AI Safety that means that we should be able to introduce a counteracting agent(s) for another AI agent(s) to align it.
  • As it is high stakes (humanity extinction), safety research should avoid capabilities advancement always where possible.

Risks from Learned Optimizaiton


Parent optimization process might intentionally or unintentionally create another child optimization process. It’s more likely if a problem couldn’t be fully solved by a parent optimizer, there are too many policies to create; the smaller parent the more likelihood of child optimizer appearance and vice versa; the parent is biased towards simplicity. Less likely if the parent has less state or not enough power to find a child. Examples of child optimization are evolution and human brain, teacher and student, society and corporation, programmer and a program that searches solutions to NP problem, and so on.

A child is likely to have goals different to parent’s goals when it is free from parent’s control or deployed in another environment. In that case a child’s goals might be totally different from parent’s ones, or be near them, or be not optimal. If a child’s objective is totally different from the parent’s then it might nevertheless increase the parent’s objective if it doesn’t interfere with its own goals or if it can pursue its own goals afterwards. It is unclear how to match a child’s objective, which is indirectly learned, to the parent’s one (inner alignment). Especially risky is a child’s deception that happens if the child can project or plan many steps ahead, knows the parent’s objective, and there is a threat of shutting down or modification. Deception is more likely as we create objective proxies, or not train well before deployment.

It is likely that without child optimization systems will be much less useful but with this optimization in place the existential risk from AI increases.

Discussion points:


In the glossary:

An optimizer is a system that internally searches through some space…

But shouldn’t it be a process and not a system? Shouldn’t it be only some computational process? To my mind the optimizer in this work is a computational process because it does a calculation to find an optimal solution. Still it is arguable if evolution is a computational process but I think it is in accordance with the Church-Turing thesis as optimization is a close ended and physical process. In that sense it is hard to imagine a pile of bricks or a rock to be an optimizer though they are systems.


Would be good to have some predictions for claims to be falsifiable. As it is largely theoretical.


At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.

What experiment might prove it false or right? Also it seems that deception is a trickier algorithm than just following a base objective. So what experiment might find the bound where an algorithm should find such a more complex strategy as deception?

Also, it seems the paper misses a list of architectures (current or future) that might exhibit mesa-optimization. Or make a hypothesis of what architectures might produce mesa-optimization.


I’m not sure that the mesa optimizer’s objective can’t be set directly. Why?


Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer.

Why can’t we make the base optimizer inject into mesa-optimizer’s objective some our objective? So that it is some mix of goals. This might be some high level restrictions like not to lie or not make some dangerous actions. That is, on each step when a base optimizer finishes with a current mesa-optimizer, why don’t we mix it with another model or strategy? Or why can’t we modify a final mesa-optimizer using reverse engineering and inject another strategy. Example might be a language model that has an output channel into a command line to execute commands and python programs from there. We can hijack its output channel and insert our python code.


I agree it is right that it poses great risk when goals are misaligned. Because if we can’t control the mesa-objective then it is likely it would pursue some proxy objective like increasing the number of smiles or clicks. And, because of Goodhart’s law, there is a tendency that such goals fail to be a good objective. Canonical example of this is King Midas who wished to have gold from whatever he touched.


Agree that it is more likely that mesa optimization will be present if problem space complexity increases as we see it in evolution. Still it is unclear how many iterations, and what time should pass before that moment. Evolution shows that complex systems appeared later than those that are simpler. So we might be satisfied with simple models and avoid complex ones. That is important because it is directly connected to x-risk and disempowerment of humanity by a more intelligent system(s).


Agree that it is easier to find misaligned than an aligned optimizer because an aligned optimizer should be more complex. Simpler solutions are more reachable because adding features to the objective doesn’t make the optimizer find it quicker but usually harder. Example is a task to find a barber in some town and find a blond barber in this town. It might be the same probability if all barbers are blond but it is not possible that the probability will increase for the second case.


Agree that the larger mesa-optimizer is the better aligned it is, but the less likely a base optimizer finds it. Because aligning powerful and critical optimizers should require a larger amount of information. Example is self-driving cars which might easily learn road signs but will require more space to learn how to react to various objects on the road (humans, dogs, etc.).


Agree that deception is an especially hard problem, but it looks to me the paper misses additional discussion about model honesty or truthfulness which might be a solution to mesa-optimizer deception. Consider detecting deception at train time: if there exists a way to ask a mesa optimizer at train time if it will give different results at test / deploy time. In that case if we have an honest mesa optimizer that can somehow tell us this information then we detect that it will be a treacherous turn and modify its objection at the train time.


It seems to me it needs more research about the risk from the collusion of deceptive agents or optimizers that might disempower humanity. Perhaps authors could propose an experiment that would prove the likelihood of simultaneous deception of several models.


Regarding the conclusion. I’m not sure about the 3rd case in the consequences:

Furthermore, in such a scenario, some parts of the outer alignment problem may not need to be solved either …

Looks like we should solve the whole outer alignment in that case because what if a base optimizer nevertheless finds a way to mesa-optimization because it might have ability and we didn’t make it not to do so. Hence again we might face the inner alignment problem or even deception. It seems there is a small risk of it. Or we need more clarification, what does “preventing mesa-optimizers from arising’’ mean. Does it mean that it might happen occasionally but disabled, i.e. possibility of it still holds? I’m saying it because it is widely known how unstable and buggy our software currently is. There is even ‘a law’ for this, Murphy’s law, that says that what can be broken will be broken.


It seems that the paper misses concrete examples of how mesa-optimization might occur in an optimizer that was not designed for mesa-optimization, i.e. unintended optimization. Here is an example derived from “The alignment problem from a deep learning perspective” by Richard Ngo. Suppose we can have one large optimizer that actively interacts with the world, some giant model constantly updating its weights and running in a large gpu farm. It will interact with the world by emitting actions on a given state, and the policy (mapping from states to actions) will be learned via self supervised learning (it is told or shown what next action should be given that input). So where is mesa-optimization here? How exactly does it map states to action? Is it like a transformer which is fine tuned for tasks? So this fine tuning will be like a mesa optimizer, i.e. it has this giant base optimizer that seems to be aligned with human values but when it does fine tuning for a specific task (say composing python programs) it creates a mesa-optimizer.