How can AI safety research reduce the risks of AI?

David Krueger

01 November 2023

Back to blog

Historically, most AI systems were designed to tackle well-specified problems in particular domains, but as AI grows more powerful and general, there is a risk that it could pose catastrophic risks, says David Krueger. Dr Krueger is Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning Lab (CBL) where he leads a research group focused on Deep Learning, AI Alignment, and AI Safety. He is also a Research Affiliate at Cambridge’s Centre for the Study of Existential Risk (CSER) and is interested in formalising and testing AI Alignment concerns and approaches.

He explains the challenges of safeguarding society against increasingly agentic and general AI systems.

What drives your interest in AI governance?

From climate change to nuclear war, many global problems and challenges are underpinned by humanity’s struggle to make effective collective decisions that serve the common good. If we do not do a good job of building global governance, I believe we will fail to cope with advances in AI, particularly as the technology will become increasingly powerful and general.

How could your research contribute to safe AI?

My research is broad but highlights safety concerns around AI. For example, I have investigated harms from increasingly agentic algorithmic systems. I was part of a team that identified four key characteristics, which tend to increase the agency of AI algorithmic systems, making them more autonomous and able to execute plans that may be unethical and very different from what humans intended. In the paper, we use the term agency to describe machine learning systems that are not fully under human control and this is a potentially catastrophic problem if a system pursues a goal that is at odds with human existence.

I have also contributed papers about issues of reward hacking and goal misgeneralization. One paper showed that even if you provide a correct formal specification of the goal, a system can mis-generalise and pursue a different goal, while the other demonstrates that arbitrarily small errors in a specificied reward function could make optimizing it unsafe.

I have explored the issue of manipulation, whereby AI models can behave deceptively or manipulatively to pursue a set goal. The way AI models tend to be built arguably implicitly supports ‘the end justifying the means’, and it can be difficult to prevent undesirable behaviour or define what is acceptable behaviour to meet a goal. For example, in a synthetic content recommendation system, where the goal was to match suitable content to users, AI used a manipulative strategy to make users more predictable rather than finding the best content.

How can we prevent the development of potentially harmful AI systems? What can help us stay in control of powerful AI?

To avoid this unintended behaviour, we need a better understanding of how systems learn and generalise from the limited datasets they are trained upon. I am a research director in the Frontier AI Taskforce. Together, we are working on developing technical infrastructure to support governance, including methods of evaluating AI systems. For example, a technical method of evaluation may be suited to fully-trained systems, while open-ended evaluations involving humans could better probe systems in certain situations. Technical interventions like these are urgently needed to increase the chance that these systems work as intended, since AI systems often behave unpredictably. We also need fundamental research to understand how to interpret evaluations, as the science of evaluating Frontier AI is immature, and we lack established best practices. It is important that government is involved here, since currently, developers mostly perform evaluations in-house, i.e., they “mark their own homework”.

I’m also interested in finding problems for which we have no technical solutions, for example, and highlighting them to contribute to a better understanding of the limitations of our current approaches. We need to think about what governance can do to protect humans from powerful or uncontrollable AI. My work aims to make researchers and regulators aware of the limitations of evaluation methods. For instance, new prompting strategies might elicit new dangerous capabilities or bypass safety guardrails. It may also be impossible to predict how AI systems will behave in deployment, especially if they are fine-tuned on new data or combined with other tools.

What do we need to do to ensure that AI serves society?

Staying in control of advanced AI systems will be incredibly difficult, but a deeper understanding of how these systems work will help. Developing better evaluation methods is one way to help ensure AI serves society; this is a relatively new engineering practice that is still immature. For example, we understand that simple metrics are not sufficient for evaluating large language models or more general systems, but not what alternatives we can use instead.

Increasing public awareness and understanding is one way to help ensure AI serves society, as well as cooperation. This includes international cooperation, as AI could become integral to the geopolitical power of countries, and a matter of national security. We have a window of opportunity to move forward with international governance around how we develop AI. If we do not, it could be more difficult to create meaningful governance in future, as AI may become increasingly integrated into key societal functions.

What role can researchers at Cambridge play?

In addition to contributing to AI governance efforts, researchers at Cambridge and other universities can study the risks of AI. AI safety is such an important and urgent problem that everyone should be encouraged to spend some time understanding the risks better and then seeing what they can do in their field, research and wider networks to expose them. It’s a huge problem. We need more talent on it and we may not have much time.

1 November 2023

This post is a write up of an interview with David Krueger, Department of Engineering