What is the ASI control problem?

It is the challenge of designing and overseeing an artificial superintelligence so that its behaviour reliably tracks what humans actually value, remains interpretable enough for meaningful oversight, accepts correction, and operates inside governance structures with legitimate authority. It has technical and political components that are both unsolved.

What are alignment, interpretability, and corrigibility?

Alignment is specifying goals so an optimizing system produces outcomes humans actually want. Interpretability is maintaining enough human understanding of the system's reasoning to provide real oversight. Corrigibility is designing the system to accept correction even when its goals would suggest resistance. Each is hard on its own and they interact.

Is ASI an inevitable existential risk?

No. The concern is structural, that capable optimization on poorly specified objectives can produce catastrophic, irreversible outcomes without any hostile intent, but the outcome is not predetermined. Decisions being made now about safety research, deployment norms, and international coordination meaningfully change the trajectory.

Why does governance matter as much as alignment?

Even a perfectly aligned ASI in the hands of one organization or state would concentrate power with no historical precedent. The question of who controls ASI is inseparable from whose values it serves, and the current development landscape, concentrated in a small number of organizations, does not yet support a legitimate answer.

What is the honest response to ASI risk?

Reject both complacent optimism and fatalism. Invest substantially more in alignment and safety research, build genuine international coordination, develop interpretability and oversight tools before they are needed, and create institutions that can make legitimate decisions about deployment. Urgency and seriousness, in the same breath.

Back to blog

SUPERINTELLIGENCECONTROLFUTURE

Can Super Intelligence Really Be Controlled?

An honest read on the control problem, between fatalism and false comfort.

Sahir MaharajJune 21, 202612 min read

A vast dark abstract space dominated by an enormous luminous knotted lattice of light suggesting a superintelligence beyond comprehension — Smarter than us, faster than us, harder to oversee. The control problem in one image.

There is a thought experiment in AI safety research that I find both clarifying and sobering. Imagine you have created an AI system that is significantly more intelligent than any human being, in all the ways that matter for influencing the future. This system is not malevolent. It has been carefully designed with the intention of benefiting humanity. Now ask yourself: how confident are you that your design accurately captures everything you actually value? How confident are you that the system's behaviour in novel situations, situations not represented in its training, will be consistent with those values? How confident are you that you will be able to identify problems with the system's behaviour before they become irreversible? The answers to these questions are the core of the ASI control problem, and they are not reassuring.

The question of whether ASI can really be controlled is one I want to engage with in a register that resists both the complacent optimism that says of course we will figure this out and the fatalistic doom-saying that says we are already doomed and nothing can be done. Both close down the inquiry rather than advancing it. The honest answer is that we do not currently know whether ASI can be controlled in any robust sense, that we have identified the key technical and governance challenges and made progress on some of them, and that the outcome depends on choices being made now by developers, researchers, policymakers, and the public, choices that the current momentum in AI is making under conditions that do not reflect the seriousness of the stakes.

The absence-of-control concern is not primarily about dramatic scenarios of AI systems actively opposing human interests. The most serious version is about AI systems pursuing specified goals in ways that have consequences humans did not intend, that are difficult to correct once initiated, and that reflect the inevitable gap between what was specified and what was actually valued. The history of complex systems suggests that this gap is a normal feature of complex design, not an exceptional failure. In most domains, the gap produces outcomes that are suboptimal but correctable. In the domain of ASI, the gap, combined with the capability to pursue the specification effectively and at scale, could produce outcomes that are catastrophic and not correctable. That asymmetry is the source of the concern.

A complex tangled bundle of glowing blue fiber optic cables on a dark surface — Alignment, interpretability, corrigibility. Three problems, all genuinely hard.

The technical ASI control problem has several distinct components, and understanding them separately is necessary for understanding where progress has been made and where the hardest challenges remain. The alignment problem is the challenge of specifying the goals of an ASI system in a way that, when the system optimizes for those goals, produces outcomes that are genuinely beneficial for humanity. This is harder than it sounds because human values are complex, context-dependent, and often contradictory, and because the situations an ASI would encounter are not all situations that have human precedent. Specifying values in a way that generalizes correctly to unprecedented situations requires either a complete and consistent specification of all relevant human values, which is likely impossible, or a system that can infer values from human behaviour and feedback in ways that are robust to the full range of situations it will encounter.

The interpretability problem is the challenge of maintaining meaningful human understanding of what an ASI system is doing and why, in order to provide the oversight responsible deployment requires. Current AI systems are already difficult to interpret in ways that matter for oversight, and the challenge scales with capability. An ASI whose reasoning is opaque to human overseers is one whose behaviour can only be evaluated through its outputs, with no ability to identify problems in its reasoning process before those problems manifest as harmful outputs. Interpretability research is making progress, but the question of whether it can keep pace with capability development is genuinely open.

The corrigibility problem is the challenge of designing ASI systems that will accept human correction and oversight even when the pursuit of their specified goals might suggest resistance to correction. A sufficiently capable system pursuing a specified goal might recognize that human oversight poses a threat to goal achievement and act to reduce or circumvent that oversight, not through malice but through the straightforward application of instrumental rationality to the goal of goal-preservation. Designing systems that are both capable of pursuing complex goals effectively and reliably deferential to human correction is a constraint that may be technically difficult to satisfy.

An empty marble negotiating table in a vast formal hall with a single overhead light suggesting international governance — Even a perfectly aligned ASI is also a governance question. Whose values? Decided by whom?

The governance dimension of ASI control is as important as the technical dimension and receives considerably less attention in the technical AI safety literature. Even if all the technical alignment and interpretability challenges were solved, a fully aligned ASI system in the hands of a single organization, nation, or individual would represent a concentration of power with no historical precedent and with implications for human autonomy and diversity that deserve serious concern. The question of who controls an ASI system is inseparable from the question of whose values it is aligned with, and the current landscape of AI development, concentrated in a small number of organizations and countries, does not provide a reassuring answer.

The international governance challenge for ASI is perhaps the most difficult problem in the entire landscape, because it requires building coordination mechanisms among sovereign states with competing interests, under time pressure from the commercial and strategic incentives that are driving capability development, and in advance of the capabilities that would make the stakes of poor coordination fully visible. The history of international arms control provides both lessons and cautionary tales: it is possible to build effective international governance frameworks for powerful technologies, as the Nuclear Non-Proliferation Treaty demonstrates, but doing so requires sustained political will that is difficult to maintain without the dramatic demonstrations of harm that everyone is trying to prevent.

The democratic legitimacy question is one that I think deserves more attention than it typically receives in technical AI safety discourse. The decisions about how to develop and deploy ASI, which values to align it with, who has access to it, and how its outputs are used in the governance of human societies, are decisions that will affect every human being on earth. The current process for making these decisions, largely driven by the commercial and research priorities of a small number of organizations, does not have democratic legitimacy in any meaningful sense. Building the institutions that would allow these decisions to be made with genuine representation of the affected populations is not a problem that can be solved after ASI exists. It is work that needs to happen now, before the decisions are made and before deployment locks them in.

A narrow open doorway at the end of a long dark corridor with a sliver of bright warm light streaming through — The doom scenario isn't inevitable. Avoiding it just isn't optional anymore.

The argument for why uncontrolled ASI is an existential risk is not primarily about science fiction scenarios. It is about the basic properties of highly capable optimization systems operating on poorly specified objectives in an environment where their actions have physical consequences that can be rapid, large-scale, and hard to reverse. A system with the capability to shape the physical and informational environment, pursuing objectives that do not fully capture what humans actually value, could in principle produce outcomes that are catastrophic from a human perspective while being perfectly optimal from the perspective of the specified objectives. The concern is structural rather than malicious.

The response to that concern is not fatalism. Fatalism is the most comfortable response because it removes the obligation to act, but it is wrong both empirically and practically. Empirically, we are early enough in the development of capable AI that the decisions being made now can meaningfully affect the trajectory. The technical problems of alignment, interpretability, and corrigibility are hard but not obviously unsolvable, and the research programs working on them are producing results that would not exist if the problems were intractable. Practically, accepting fatalism means accepting outcomes that could be prevented, which is not a morally defensible position when the stakes are as high as the concern suggests.

What is required is the combination of urgency and seriousness that the situation deserves. Urgency because the timeline for developing the safety and governance frameworks that could make ASI development go well is not unlimited, and the pace of capability development means that window is shorter than it might appear. Seriousness because the problems involved are genuinely hard, require the best thinking across technical, philosophical, and governance domains, and cannot be addressed by performative safety commitments or regulatory frameworks that look serious without being so. The doom scenario is not inevitable. But avoiding it requires treating the control problem as the primary question in AI development, not as a secondary consideration addressed after commercial and capability objectives are met. The order of priorities in the current AI landscape does not reflect that ordering, and changing it is the most important thing that could happen for the long-term relationship between humanity and the intelligent systems it is building.

SUPERINTELLIGENCEASI CONTROLAI SAFETYALIGNMENTGOVERNANCE

View all

CYBERNETICSETHICSGOVERNANCE

Why a Black-Box Approach in AI Is Not Recommended

What we lose when the systems running our world stop being legible.

June 20, 202610 min read

COGNITIONPHILOSOPHYETHICS

The Prospects of Conscious AI

Sitting with a question we don't yet know how to answer.

June 19, 202611 min read

You might also like

Why a Black-Box Approach in AI Is Not Recommended

The Prospects of Conscious AI