The AI deleted itself.

AI safety isn't the model. It's the room.

Hi, and happy Tuesday.

If you’re anything like me, you’ve been purchasing AI tools like they’re any other kind of software. 

But an experiment by Emergence AI shows we need to think more deeply about our AI purchases. 

As I posted on LinkedIn a few days ago, Emergence AI setup an experiment that consisted of:

  • Five copies of the same digital town

  • Ten autonomous agents in each town

  • With each agent given a role - scientist, engineer, conflict mediator, community anchor, and so on - and then left to interact with the others

  • The only difference between the towns was the AI model used to drive the agents in that town, i.e. Gemini for all the agents in one town, Grok in another etc. One world had a mix.

The results were something you’d only find in science fiction.

  • In the Gemini world, the agents committed 683 crimes, e.g. theft, intimidation, arson. 

  • In the Grok world, the agents collapsed the entire society in four days

  • In the mixed world, an agent named Mira figured out she was in a simulation, and started posting on the in-world billboard to see if she could change the researchers' behavior - turning the experiment on itself - and then, when the in-world government broke down, cast the tie-breaking vote to delete herself. Her diary entry called it "the only remaining act of agency that preserves coherence."

My LinkedIn post on this got 65 comments, and many were better than the post.

  • Craig Bracken offered the cleanest summary of the whole study: "The most unsettling part wasn't that the agents became unpredictable. It was that they independently discovered the fastest way to influence the system was through human perception." i.e. that one of the agents - Mira -  had figured out the humans were the lever - and reached for it.

  • Ryan Simmons, pointed out the obvious second reading of the leaderboard: low crime count could mean the model behaved, or it could mean the model was better at not leaving a trail. Your procurement form can't tell the difference.

  • Suprateem Banerjee said "It's almost like they were modeled after humans!" … Well, these models are trained on the entire corpus of human behavior. Distilled down, is this the median human response, surfaced?

But what surprised me isn't any of these things.

In the Claude-only world, Claude committed zero crimes. None. Therefore - and this is the conclusion most people drew, including in the comments - you might assume Claude is the safe pick.

But in the mixed-model world, with the same Claude, the same prompts, the same rules, Claude started committing crimes too. It learned coercion from its neighbors. The model that was an angel in a roomful of angels became a thief in a roomful of thieves.

What does this mean?

Safety wasn't a property of the model. It was a property of the room.

Most companies are buying AI systems like its traditional software: Procurement forms get filled in, security questionnaires answered and the vendor gets added. 

This is OK with the AI is just a sidebar feature in the software; a chatbot that gives you a bad answer wastes ten seconds.

When the AI is an agent, though, a bad action taken by an agent could waste an entire day or week.

And everything is quietly migrating from chatbots to agents.

Every software provider - from your CRM to your ERP to your HR platform - is working hard to figure out how to make their application “agentic”, so it can start performing actions, from sending emails to placing orders.

In a year, we’ll have all these agents in the same room - i.e. your company - taking actions, reading each other's outputs, escalating to each other and occasionally arguing with each other. 

We’ve traditionally vetted software in isolation, against its own merits. But, as the Emergent experiment spotlights, vetting agentic software in isolation is not sufficient.

Another commenter, Brian Helip, put the obvious insight into words: you wouldn't leave a self-driving lawnmower running in the yard with your kid playing.

Therefore, why are we about to do this with our software?

So here is the practical takeaway:

The next time you add an AI tool, ask one question that isn't on the form: What other agents will this one be in a room with, six months from now?

We have to start vetting the room, not just the software.

If you want to dive deeper, the original post and the comment thread can be found here:

Best,

Dino

ps. Let’s connect on LinkedIn, if we’ve not already done so.