May 28, 2023

Fundamental AGI alignment problems

Davidad offers a list of fundamental problems (a) that need to be solved to align an AGI; Siméon takes a stab at translating them into nontechnical language (a):

Davidad: Value is fragile and hard to specify

Siméon: Human values are hard to describe and write down and many small imprecisions in the way we characterize them might have catastrophic consequences if a powerful agent (e.g. an AGI) tries very hard to achieve those values.

A bit like small failures in a law are heavily exploited by big corporations, leading to highly undesirable consequences.

Davidad: Corrigibility is anti-natural

Siméon: Without specific countermeasures, AIs beyond a certain level of capabilities will refuse to be modified or shut down.

Davidad: Pivotal processes require dangerous capabilities

Siméon: To make sure that the first aligned AGIs are not outcompeted by unaligned AGIs a few months later, the first aligned AGIs will have to take some actions (here called "pivotal processes") that require dangerous capabilities to be executed well.

D: Goals misgeneralize out of distribution

S: By default, the goals that an AI will learn during its training won't generalize in a way that satisfies humans once this AI is exposed (in the real world) to situations it has never seen during its training.

D: Instrumental convergence

S: Most large-scale objectives can be better achieved if an AI acquires resources (political power, money, intelligence, influence over individuals etc).

By default, as AIs become more capable, they will reach a point where they will be able to successfully execute strategies that de facto overpower humanity. Avoiding that they do that despite it a) being optimal for their own goals & b) having the ability to do so will require efforts.

D: Pivotal processes likely require incomprehensibly complex plans

S: The plans (e.g. pivotal processes) that allow the world to reach a point of stability (i.e. the chances that humanity goes extinct rebecomes extremely low) will probably be incomprehensible to humans.

D: Superintelligence can fool human supervisors

S: Self-explanatory.

D: Superintelligence can hack software supervisors

S: Superintelligence can hack (i.e. mislead, break or manipulate) the other AIs that humans have put in place to supervise the superintelligence.

D: Humans cannot be first-class parties to a superintelligence value handshake

S: (I'm not literate enough to interpret the scriptures here); [Zvi (a): To take a first non-technical shot at #9: Superintelligent computer programs are code, so they can see each others’ code and provably modify their own code, in order to reliably coordinate actions and agree on a joint prioritization of values and division of resources. Humans can’t do this, risking us being left out.]

D: Humanlike minds/goals are not necessarily safe

S: It might be contingent that humans don't cause destruction around them. It might be feasible to build minds with human capabilities that would do so.

D: Someone else will deploy unsafe superintelligence first (possibly by stealing it from you)

S: Avoiding human extinction will require a certain velocity in execution OR a global governance regime which prevents any race between different actors building their own AGIs. Otherwise, unsafe actors always pose the risk to deploy unsafe superintelligence first (possibly by stealing it from safe actors).

D: Unsafe superintelligence in a box might figure out what’s going on and find a way to exfiltrate itself by steganography and spearphishing

S: Even if contained with very advanced measures, a superintelligent mind might be able to communicate with the external world in ways that are not understandable to humans (i.e. steganography) and advanced manipulative techniques.

D: We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens, both by their own lights and from a depersonalized well-being perspective. (Eliezer may disagree with this one)

S: There is a moral obligation to do a pivotal process (i.e. an action which prevents any rogue actor from building a misaligned AGI) which is as close as possible from good & fair for everyone.

Examples of such processes could be: (i) enforce a perfect monitoring of high-end compute in a way which prevents people from building unaligned AGI; (ii) make huge amounts of money and buy all the competitors that are less safety conscious to prevent them from building an unaligned AGI.

Zvi distills further, down to five core problems (a):

1) We don’t know how to determine what an AGI’s goals or values would be.

2) We don’t know what goals or values would result in good outcomes if given to an AGI, and once chosen we won’t know how to change them.

3) Things that are smarter than you will outsmart you in ways you don’t anticipate, and the world they create won’t have a meaningful or productive place for us.

4) Coordination is hard, competition and competitive pressures are ever-present.

5) Getting out of the danger zone requires capabilities from the danger zone.