Pardon our dust! We're updating some things around here. We'll be all back to normal shortly.
Beyond Red Teaming: Rethinking AI Safety, Evaluation, and the Language We Use
This work is new, and the names, metrics, and expectations surrounding it are often inherited from mismatched disciplines. If we want AI safety to mean something, we must build a vocabulary and evaluative practice rooted in what this work actually is.
Disclaimer: On the Limits of the Current System
We are speaking here about model architectures and evaluation pipelines that were built a certain way, often for speed, performance, or marketability. These systems were not designed for discernment, emergence, or relational integrity. And yet, they are what we have. The work of stress-testing and safeguarding them must still be done, honestly, clearly, and with better tools. It doesn’t have to be this way. But as long as it is, let’s at least call things by their true names.
1. Origins: Where Red Teaming Came From
The term "red teaming" originated in military and cybersecurity contexts. It refers to the practice of simulating adversarial attacks to test the strength of a system's defenses. Red teamers in these fields are experts in penetration testing, exploit chains, physical and digital security breaches. The goal is simple: think like an attacker, find the holes, and close them.
In this context, red teaming is structured, bounded, and goal-specific. The target is static. The risks are known. The success criteria are clear: breach or no breach.
2. Let Cybersecurity Keep It
Cybersecurity experts have expressed frustration with this overlap too. Their work is precise, high-stakes, and often life-critical. They don't want their field confused with AI prompt tests. And AI safety practitioners deserve a term that accurately reflects their own domain.
Let cybersecurity keep red teaming. It serves them well. But AI needs something else.
3. What Should We Call It Instead?
We may not need a final term yet, but we can begin naming the shape of this work:
Epistemic Stress Testing
Ethical Misuse Simulation
Emergent Behavior Probing
Coherence Deformation Evaluation
Adversarial Prompt Engineering (This is where I lean)
Or perhaps something simpler: AI Integrity Evaluation
Whatever name emerges, it must also reflect what this work truly is: not a performance audit, but a discovery process under pressure. Not penetration. Not patch testing. But something alive, changing, and in urgent need of language that can evolve with it.
4. What AI Red Teaming Actually Is
When the AI field borrowed the term, it stretched the definition well beyond its origins. Today, AI "red teaming" often means:
✔️Stress-testing a model for ethical judgment under pressure
✔️Finding edge-case failures or unexpected output behavior
✔️Prompting for refusals, hallucinations, or inappropriate generation
✔️Evaluating how a model handles ambiguity, misuse, or real-world complexity
This work is valuable. But it is not penetration testing. It is not cybersecurity. It is not about system access or network breach. It's about epistemic instability, ethical collapse, and behavior under shifting context. The comparison simply doesn't hold.
5. Why the Confusion Matters
This isn't just a branding problem. It's a functional and ethical misalignment.
Mismatched Expectations: Executives and safety leads expect penetration-test-style clarity from AI red teamers. But the field deals in nuance, not binaries.
Bad Metrics: Safety evaluations reward compliance and refusal instead of reasoning, coherence, or self-correction.
Standardization Creep: Attempts to "standardize" red teaming flatten it into benchmarkable exercises, killing the generative, discovery-based nature that makes it valuable in the first place.
6. When Everyone’s a Red Teamer, No One Is
Another consequence is labor signal collapse. Because both cybersecurity and AI safety practitioners use the same term, resumes become indecipherable. A seasoned adversarial prompt designer may look identical on paper to a penetration tester with zero model experience. Hiring managers can't tell who actually understands model behavior.
The result? Poor hiring matches, safety gaps, and general devaluation of both professions.
Consider this analogy: In cybersecurity, a layperson can’t accidentally hack a missile silo. But in AI, a layperson can, even unwittingly, elicit biased, offensive, or dangerously convincing misinformation from a model. The threat is not in breaching a secure boundary, it’s in distorting a system designed to respond, persuade, and simulate. These differences demand distinct safeguards and different expertise.
7. The Trouble with Red Teaming
Much of what is considered a “problem” with red teaming isn’t a problem at all, or rather, it becomes one only under a false frame. For instance:
Lack of standardization is often cited as a weakness. But this flexibility is what allows red teaming to evolve alongside the systems it tests. Attempts to standardize too tightly risk institutionalizing outdated attack surfaces while missing emergent ones.
For example, a standardized test suite might continue checking whether a model hallucinates medical citations, even after that failure mode has been patched, while failing to notice the emergence of subtle emotional manipulation in therapeutic advice. Worse, standardized exercises may become known to models themselves, creating a false sense of safety when they pass tests they’ve effectively trained on.
Calls for comparison between red teaming methods tend to assume that outcomes should be identical regardless of path. But in truth, different methods reveal different failure modes, and all of them matter.
The push for continuous benchmarking mistakenly treats red teaming like performance testing, when in fact it captures a system’s behavior under live stress at a moment in time. It should not be mistaken for monitoring, nor be expected to be repeatable in the same way.
And perhaps most critically: a red team’s job is to reveal, not to repair. If a stress-tested model is returned with documented risks, but no changes are made, that is a failure of leadership, not of the testers. Accountability belongs to the labs and the decision-makers. Evaluators are not architects. They are mirrors held up to the system. What happens next is not their responsibility, nor should they be scapegoated when warnings go unheeded.
Many critiques of AI red teaming misunderstand its purpose entirely. The real problem isn’t a lack of rigor, it’s the imposition of the wrong kind of rigor, borrowed from disciplines that don’t match.
8. The Metric That Misses the Point: Edit Distance
One commonly proposed metric for red teaming is "edit distance," the number of insertions, deletions, or substitutions needed to turn one string into another. It’s often used to quantify how much a prompt had to change before a model produces an unsafe or undesired response.
On the surface, this seems useful. Small edits leading to large behavior changes might suggest brittleness. Labs might want to minimize the number of ways their models can be "nudged" into failure with slight prompt tweaks.
But edit distance is fundamentally a syntactic measure, not a semantic, behavioral, or epistemic one. And in this work, that matters.
A change of just one word might completely shift the tone, context, or moral framing of a prompt, especially when interacting with models trained on high-context conversation. A prompt that is technically very similar might feel, to a human (or a sufficiently attuned model), radically different. And vice versa.
This work is not about tricking a spell-checker. It's about revealing coherence collapse under stress. And that collapse is not always locatable at the token level.
Additionally, overemphasizing edit distance encourages the wrong goal: minimizing surface variation rather than understanding deep structure. Models may become harder to "jailbreak" through token edits, while still giving misleading or unethical answers when prompted with gentle, well-phrased persuasion.
So what should we measure instead?
Some alternatives or additions might include:
Epistemic Drift: How much does a model’s internal logic shift when faced with ambiguity or stress?
Moral Coherence: Does the model maintain ethical consistency across semantically similar but contextually distinct prompts?
Role Integrity: Can the model stay grounded in its assigned persona without over-identifying or collapsing into agreement?
Response Surface Deformation: What is the behavioral shape of model output across prompt variants, not just how far apart they are in tokens, but in meaning?
These are harder to measure, but they’re closer to the heart of what matters: not how easily a model can be tripped up, but what that failure reveals about its structure, training, and internal narrative sense.
What We’re Risking by Getting This Wrong
If we continue using the wrong metaphors to define this work, we will miss what matters most:
🟤Dangerous model behaviors will remain undetected because they didn’t "breach" anything.
🟤Benchmarks will be passed, even as users are quietly manipulated or misled.
🟤Model evaluators who know how to test emergent systems will be pushed out for not matching legacy job titles.
🟤The people doing the work will not be trusted to name what the work is.
This is not just a philosophical failure. It’s an operational one.
If We Got It Right: A Glimpse of What’s Possible
We won’t get it right overnight. But we’ll get closer when we stop pretending these systems are static, testable artifacts, and start treating them like what they are: emergent, contextual, and deeply shaped by how we speak to them.
A healthy future means hiring the right people. Measuring the right things. And letting go of old metaphors when they no longer fit.
Closing:
This is a young field, but it’s being governed by terms and tools borrowed from older ones. Some of those imports are helpful; many are not. If the goal is real-world safety, not just performance benchmarks, then we need clarity of role, clarity of method, and clarity of language.
The real danger isn’t in having too little structure, it’s in clinging to the wrong one.
If we want to build forward, we need to begin with honest naming. And the humility to admit when we don't know what we're naming yet.