Preliminary Taxonomy of Pre-Deployment Frontier AI Safety Evaluations

FMF · Issue Brief

Type

Report

classification

Source

FMF

publisher

Published

2024

December 20, 2024

Series

Issue Brief

document class

Pages

—

source PDF

Words

1,146

full text on file

Topics

tagged subjects

Full text

On file

readable here

Source of record

FMF

evaluationspre-deployment

Abstract

Issue Brief: Preliminary Taxonomy of Pre-Deployment Frontier AI Safety Evaluations

Full text

Preliminary Taxonomy of Pre-Deployment Frontier AI Safety Evaluations

Issue Brief: Preliminary Taxonomy of Pre-Deployment Frontier AI Safety Evaluations

By: Frontier Model Forum Posted on: 20th December 2024

As frontier AI systems continue to advance, rigorous and scientifically grounded safety evaluations will be increasingly essential. Although frontier AI holds immense promise for society, the growing capabilities of advanced AI systems may also introduce risks to public safety and security. Ensuring such systems benefit society without compromising safety will depend on the development of robust mechanisms for identifying and mitigating potential harms. Safety evaluations, which aim to measure one or more safety-relevant capabilities or behaviors of a model or system, are a key mechanism by which model or system risks are assessed more broadly.

A cohesive evaluation ecosystem for frontier AI systems will be critical to their safe and responsible development. Yet current evaluations for frontier AI models and systems differ substantially in their methods, purpose, and terminology. Establishing a shared understanding of the functions and types of evaluations is a key first step toward building a more effective ecosystem. This is especially true for safety evaluations that are carried out before a model or system is released and face different constraints than post-deployment evaluations focused on user impacts.

This issue brief offers an initial high-level taxonomy of pre-deployment safety evaluations for frontier AI models and systems. Based on the public literature as well as input from safety experts across the Frontier Model Forum, the brief is part of a broader workstream that aims to inform public discussion of best practices for AI safety evaluations.

Recommended Taxonomy

As opposed to more commercially-focused evaluations, which typically focus on performance metrics, safety evaluations aim to assess the potential risks of a given frontier AI model or system whose capabilities could be misused to cause harm or could lead to unintended harm. Risks refer to outcomes that are not considered desirable (or even intentional) and can lead to negative impacts on users, groups, entities, systems, or societies, and that may arise as a result of the behaviors or capabilities of an AI model or system.

Methodology

Safety evaluations can be distinguished in terms of methodology. For evaluations of AI models or systems themselves, two common methods include:

Benchmark evaluations. These focus on quantifying model capabilities using standardized criteria in ways that enable comparison across scales, time periods, and different models and systems. "Benchmark evaluations are designed to be replicable and are typically automated, although some benchmarks may involve manual scoring or grading."

Red-teaming exercises. Red-teaming for frontier AI refers to "a structured testing effort to find flaws and vulnerabilities in an AI model or system, often in a controlled environment and in collaboration with developers of AI." These exercises differ from capability evaluations based on their adversarial nature. Red-teaming helps identify specific harmful capabilities by simulating potential attacks or misuse scenarios. While some companies use automated red-teaming, human experts typically conduct these exercises to explore novel risks and help prioritize potential issues for more systematic assessment.

For evaluations of the impact of AI models or systems on human capabilities and actors' abilities to achieve specific, real-world outcomes, another common approach is:

Controlled Trials. These studies include human subjects divided into treatment and control groups, typically with treatment groups having access to an advanced AI model or system. Controlled trials are often used in uplift studies designed to assess how frontier AI affects human performance on potentially risky or harmful tasks.

Objective

Safety evaluations can also be distinguished in terms of their objective. For example, safety evaluations may include:

Maximal Capability Evaluations. These assessments seek to determine "the nature and scale of capabilities before safety mitigations and guardrails have been put into place." They establish a model's upper bound of potentially harmful capabilities to identify needed safety measures. These evaluations typically occur when models are most performant, after instruction fine-tuning but before guardrails are implemented. They employ all available tools including fine-tuning, prompt engineering, and scaffolding, and typically keep humans involved for oversight. Importantly, these should be grounded in specific threat models—examining particular actors' abilities in defined contexts—rather than assuming unlimited adversarial resources.

Safeguard evaluations. These assessments evaluate "the nature and scale of potentially harmful capabilities after safety mitigations and guardrails have been put into place." They occur after fine-tuning and safety implementation, then again if the model changes or is further modified. Prompt engineering serves as the main tool. These evaluations are more likely than maximal capability evaluations to be automated for reproducibility and comparison, though human oversight may continue.

Safeguard evaluations can be categorized as domain-agnostic or domain-specific. Domain-agnostic evaluations test for general vulnerabilities, such as jailbreaking techniques. Domain-specific evaluations assess whether safeguards prevent particular misuse forms, like disseminating hazardous information. For example, biological domain evaluations may test how effectively safeguards prevent models from aiding novel pathogen creation.

Uplift studies. These evaluations address concerns that frontier AI might enhance human users' abilities to conduct harmful activities. "Uplift studies are designed to evaluate that risk, typically by using a controlled trial with a treatment group that has access to frontier AI and a baseline or control group that is limited to alternate resources." For instance, biology uplift studies might provide one group AI access and another only web search while both attempt pathogen synthesis protocols, measuring capability differences.

Table 1: Safety Evaluations

| | Maximal Capability Evaluations | Safeguard Evaluations | |---|---|---| | **Measurement Objective** | Assess the upper bound of capabilities that could be misused to cause harm or lead to unintended harm. | Assess the nature and scale of the risks posed by models or systems with safety and security guardrails in place. | | **Measurement Purpose** | Identify and understand the model's upper bound of risk and identify any needed safety and security mitigations. | To guide governance and deployment decisions and need for additional safety mitigations. | | **Timing** | Typically when models are most performant, which is generally after instruction fine-tuning and prior to the implementation of safety guardrails. | Post fine-tuning and safety mitigations; again if the model datamix or mitigations change, the model is further fine-tuned a new tool is added and when a model is integrated as part of a broader system. | | **Tools / Methods** | All available tools, including fine-tuning, prompt engineering and any additional scaffolding relevant to a particular threat model; humans in the loop are more likely. | Primarily prompt engineering. |

Conclusion

As noted earlier, this preliminary taxonomy aims to inform shared understandings of evaluations. By aligning on a common understanding of frontier AI evaluations, the ecosystem can more easily learn from and build on early efforts.

A robust and effective evaluation ecosystem, however, will require shared understandings of other best practices too. To that end, the Frontier Model Forum aims to publish further issue briefs soon on related topics, such as depth and frequency of testing.