How the AI Safety Institute is approaching evaluations
The AI Safety Institute (AISI) was established as part of DSIT, poised as the first state-backed organisation focused on advanced AI safety for the public benefit. They have established three core functions at AISI: Developing and conducting evaluations on advanced AI systems, driving foundational AI safety research and facilitating information exchange.
The AISI has provided more clarity on their approaches to evaluations in a recently published insight.
Understanding the AISI evaluations
The AISI is beginning to put ethics principles into practice with their first milestone being to build an evaluation process for assessing the capabilities of the next generation of advanced AI systems in a nascent and fast-developing field of science.
These evaluations will assess the capabilities of systems with the following (alongside others) techniques: red teaming (experts interact with a model and test its capabilities by trying to break its safeguards), human uplift evaluations (assess how bad actors could use systems to carry out real-world harms) and AI agent evaluations (test AI agents ability to operate semi-autonomously and use tools like external databases to take actions in the world).
Prior to the AI Safety Summit, the government published a paper on key risk areas of Frontier AI: misuse, societal impacts, autonomous systems and safeguards. These are the areas of focus for the pre-deployment testing, although the AISI is continuously surveying and scoping other risks.
The AISI is an independent evaluator, and the details of their methodology will be kept confidential to prevent manipulation, however they will publish select portions of the evaluation results with restrictions on proprietary, sensitive or national security-related information.
It is important to note that the AISI is not intending for these evaluations to act as stamps for a ‘safe’ or ‘unsafe’ system but instead as early warning signs of potential harm, describing themselves as a ‘supplementary layer of oversight.’ Ultimately the AISI is not a regulator, and the decision to release systems will remain with the parties developing the systems.
Aside from evaluations the AISI is focused on furthering technological advances and is therefore launching foundational AI safety research efforts across areas such as capabilities elicitation, jailbreaking, explainability and novel approaches to AI alignment.
AISI Criteria for Selecting Models to Evaluate
Models selected for evaluation will be based on the estimated risk of a system’s harmful capabilities in relation to national security and societal impacts, including accessibility. Varying access controls will not keep companies from evaluations as the AISI will evaluate systems that are openly released as well as those which are not. During the Global AI Safety Summit, several AI companies committed to government evaluations of their deep models.
Next Steps
As the AISI continues to research the transformative potential of responsible adoption of advanced AI systems for the UK’s economic growth and public service support, it is encouraging to see the efforts to address associated risks through evaluations conducted by AISI. While detailed evaluation results and AISIs methodologies will not be publicly disclosed to prevent manipulation risks, periodic updates like this are crucial to highlight AISI’s activities.
The Institute’s progress towards developing and deploying evaluations provides insight for companies into what risks the Institute is prioritising. Recognising and incorporating these insights into development processes could enhance safety measures and promote responsible AI adoption. We have produced insights on the ambitions of the institute as well as their first, second and third progress report for further reading.
techUK will continue to monitor AISI’s evaluation work to keep members informed of the latest developments in this field. To find out more and get involved in techUK’s programme work on AI Safety, please contact [email protected].
Original release can be referenced here.