How important is AI hacking as LLMs advance

Offensive technologies will likely become more advanced while the defense ones will lag behind. AI hacking capabilities might be the second most important problem after AI alignment. 

Epistemic status: Exploratory / brainstorming. 

Cross-posted at

How language models will affect the offensive/defensive symmetry

I imagine that if language models will continue to be improved, this symmetry won’t be stable but fluctuate like a highly volatile stock market. That might lead to more fragmentation of the web (because state actors might want to protect themselves from more advanced ones as we see it in China, Russia, North Korea, Iran), widespread use of cryptography, 2F or/and biometrical authentication (like Worldcoin), abandoning older versions of protocols or canceling them at all (e.g. HTTP, SMTP) in favor of more secure ones. Overall, technologies would be less and less easy to use, see “Nova DasSarma on Why Information Security May Be Critical to the Safe Development of AI Systems' ([link]). If models improve further or become able to self-improve, this might lead to global blackout of financial, communication, military and other systems with large consequences like human extinction. However, it looks like there will be some slow down in model releases or public access to weights (LLAMA v2 70B is the last of publicly available large models perhaps).

Perhaps there will be improvements in encryption and decryption algorithms using the help of neural networks like we saw for matrix multiplication, [Fawzi et al. (2022)]. However, the spread of these won’t be instant. Users don’t update their software regularly, they run old versions for as long as recent computer viruses showed: WannaCry Ransomware that impacted thousands of computers world wide, [“Shadow Brokers Threaten to Release Windows 10 Hacking Tools”] (2017).

It looks like the open source software will be less and less spread as the bottleneck might be lazy users that don’t want to update their software. So their software won’t be in reality theirs but more and more downloaded, updated, installed, uninstalled by third companies. Or it will be only services with terminals for end users. How the open source software might be used more is through global cooperation to improve it (public good) like in Linux distributions that have rolling updates which means all software updates on a regular basis, no individual updates.

We already see polarization for the web in countries like China (Great Firewall), in Russia. Now it is possible to break through it with VPN, TOR but it seems that firewall technologies will improve more or there will be only a few gateways accessible from within those states. This might lead to a more unstable world (next global war, etc.) as per the unilateralist’s model, [Bostrom et al. (2016)], where someone might make a wrong estimation (my AGI will be aligned or it is better to strike first, etc.).

Overall, I agree with offensive-then-defensive development symmetry as per Garfinkel and Dafoe (2019) :

Simple models of ground invasions and cyberattacks that exploit software vulnerabilities suggest that, in both cases, growth in investments will favor offense when investment levels are sufficiently low and favor defense when they are sufficiently high.

How important is hacking in an AI takeover attempt

In some scenarios, we might de-prioritize hacking, but in other scenarios, hacking is required or a highly likely capability that AI will improve from what we have now. Humans can do hacking, or AI alone can hack. Both scenarios are threatening. I will focus on humans hacking first. Though the question assumes that some AI would do hacking, which leads to a takeover, I think human hacking increases the risk too. Basic scenario here is that some group uses hacking to steal model weights and then this group recklessly enhances this model, which results in AI takeover.

Hacking includes these, each of which can be offensive or defensive: automated network attacks, automated program patching, automated program vulnerabilities detection, attacking humans (phishing sites, fake accounts), attacks on neural nets (adversarial attacks on black box), and others. So hacking is highly important because humans can improve it using AI. Let’s briefly look into some scenarios of this. To quote from [Hendrycks and Mazeika (2022)] (see citations there):

Hacking currently requires specialized skills, but if state-of-the-art ML models could be fine-tuned for hacking, then the barrier to entry for hacking may decrease sharply. Since cyberattacks can destroy valuable information and even destroy critical physical infrastructure such as power grids and building hardware, these potential attacks are a looming threat to international security.

Hacking increases the risk of stealing powerful AI models and fine tuning them for various tasks we discuss below, like advanced destructive capabilities. That is a likely scenario, and it happens now and again, see [Patel (2023)].

Now, let’s view different AI capabilities and their relations that powerful AI would use to take over. I took these from [Carlsmith (2022)]. It is tempting to say that AI will always use hacking as various entertainment movies show. I will therefore try to emphasize scenarios where AI can do a takeover without hacking, trying to prove that it is not as important as it seems.

Hacking vs secrecy. Here secrecy means that models would act without being noticed by important actors or the public. This can be developing weapons in secret places using human labor by paying them stolen money or somehow. Or this can be models training in the cloud in secrecy. In our digital era, these likely entail illegal actions, fake accounts, money laundering, etc. and thus hacking of some sort. So hacking is more important than secrecy, as one would obtain secrecy by using hacking. With a slow takeoff speed, secrecy might play a less important role.

Hacking vs social influence. Hacking seems to be an important tool here like faking influencers or celebrities, or faking messages from official organizations, etc. because to do so, those messages or sites need to have all official credentials and so one needs to gain control of those [Golshan (2016)]. With that being said, I can imagine an AI takeover scenario with only social influence like steering research into some favorable direction, or persuading engineers to use specific tools that AI wants (like use only this nodejs library), or moving public opinion to vote for specific candidates, etc. That means that hacking is not a necessary step to use social influence and thus is not as important as with secrecy, for example.

Hacking vs coordination. Coordination means here the ability of many AI agents to direct their efforts or actions in some specific direction. But coordination itself can’t lead to humanity’s disempowerment and looks less important than the capacity to hack into systems and gain control of them. Therefore, hacking is more important compared to this capability. Still, coordination might be a vital capability in the long run or in a slow takeoff speed (years or decades) because AI agents would need to resist human efforts to detect their hazards, or modify them. Also, perhaps coordination will be required in some scenarios of takeover via hacking. I mean here the scenario when AI breaks out and makes copies of itself all over the internet, [Ord (2020)] . There are already such cases when computer viruses spread all over the internet, creating millions of copies of itself. The largest known botnet was Bredolab, which contained more than 30 million computers, “BredoLab Downed Botnet Linked with” (2010) .

Hacking vs destructive capacity. Possible things here: biological or chemical or nuclear weapons; drones/robots; advanced weaponry; ubiquitous monitoring, surveillance; attacks on (or sufficient indifference to) background conditions of human survival (food, water, air, energy, habitable climate); and so on. I can imagine this being done without hacking. Not sure, AI would do this alone, but a more probable scenario is that AI would deceive some state or corporation to build all of this, or they might do it on their own and then AI might use it for the takeover. Still, hacking might not play any role here. Perhaps, slow takeoff speeds such AI weaponization will be more noticeable by human actors and will require some hacking and secrecy. And with a fast takeoff hacking won’t matter much, i.e. say it would be easier to wage a nuclear war than to take all the efforts to hack into systems.

Hacking vs tech development. Technological development means the ability to research and develop any technologies (computer hardware, vehicle, biotechnologies, etc.). Hacking seems to be just one technology among many other ones and is not as important. Also, the ability to improve its own capabilities by AI doesn’t require hacking.

Overall, it seems none of the mentioned capabilities play its role alone, but some mixture of those is more likely. And hacking still plays a role in them even when we try to imagine a scenario without it (and other scenarios seem to require hacking, e.g. cracking into a nuclear plant or bio laboratory system to take control of it).

Conclusion. In several listed comparisons, AI hacking capability is highly dangerous and is likely to be developed by AI itself if AI attempts a takeover. Similarly, it is almost certain that malicious human actors will develop AI hacking (we already see some, eg. [Palisade research] about GPT-4 hacking abilities). As the world is full of valuable and dangerous tools, making sure that we handle hacking safely (defense won’t lack behind offense) is perhaps the second most important problem in AI safety after alignment itself.


“BredoLab Downed Botnet Linked with” 2010. Infosecurity Magazine. November 1, 2010. [].

Carlsmith, Joseph. 2022. “Is Power-Seeking AI an Existential Risk?” [].

Fawzi, Alhussein, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, et al. 2022. “Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning.” Nature 610 (7930): 47–53. [].

Garfinkel, Ben, and Allan Dafoe. 2019. “How Does the Offense-Defense Balance Scale?” Journal of Strategic Studies 42 (6): 736–63. [].

Golshan, Tara. 2016. “How John Podesta’s Email Got Hacked, and How to Not Let It Happen to You.” Vox. October 28, 2016. [].

Hendrycks, Dan, and Mantas Mazeika. 2022. “X-Risk Analysis for AI Research.” [].

“Nova DasSarma on Why Information Security May Be Critical to the Safe Development of AI Systems.” n.d. 80,000 Hours. [].

Ord, Toby. 2020. The precipice: existential risk and the future of humanity. london New York (N.Y.): Bloomsbury academic.

Patel, Dylan. 2023. “Nvidia Hacked - A National Security Disaster.” August 28, 2023. [].

“Shadow Brokers Threaten to Release Windows 10 Hacking Tools.” 2017. The Express Tribune. May 31, 2017. [].

Bostrom, Nick, Thomas Douglas, and Anders Sandberg. “The Unilateralist’s Curse and the Case for a Principle of Conformity.” Social Epistemology 30, no. 4 (July 3, 2016): 350–71. [].


I want to thank Jeffrey Ladish for organizing this track at the MATS program, Evan Ryan Gunter for his feedback. Produced as an application to the ML Alignment & Theory Scholars Program.