How four Greek winds became an AI agent that attacks arguments from every direction

A pressure-testing AI agent built on four adversarial perspectives and one rule: they can't talk to each other.

Jun 12, 2026

∙ Paid

Eurus…

What a beautiful name. The first time I’ve heard it, it was in BBC’s Sherlock, season four. Holmes’s sister is named Eurus, after the East Wind.

She’s the most dangerous character in the whole series (even more dangerous than Moriarty) and her name stuck with me since then, for the whole almost 10 years since I first heard it.

3 Years Before Enola Holmes, This TV Show Introduced One Sherlock Holmes Character Much Less Successfully — Eurus Holmes, BBC Sherlock

This month, I’ve started building an AI agent for pressure-testing arguments and I remembered Eurus. And, of course, since I already wrote a piece on waves and water (matching your AI to your nervous system) with Jen Benford on Wednesday, I realized… why not match the force of the ocean with the force of the winds, and publish both in the same week?

One incoming wind gust tells you NOTHING about the weather. In the same way, one adversarial pass on a thesis tells you nothing about where that thesis breaks.

The ancient Greeks named four winds because one single system couldn’t predict the other three very reliably.

Boreas from the north. Notus from the South. Eurus from the east. Zephyrus from the west.

They didn’t send a single scout when the storm could arrive from any direction and neither should you.

Today, I’m sharing the pressure-testing agent I built around the four winds. Four directions of pressure, four distinct failure modes. If you’re only testing from one direction, you’re reading a single thermometer and calling it the forecast.

Meet the FOUR WINDS:

❄️ Boreas (The Steelman) builds the strongest opposing case, the counter-argument your skeptical friend was too polite or too aligned with you to construct.
🏛️ Notus (The Historian) finds the precedent where this reasoning already failed, under a different name, in a different decade.
👥 Eurus (The Audience Proxy) simulates how your readers will misread, dismiss, or screenshot your claim and use it against you.
⏳ Zephyrus (The Time Test) checks what breaks in six months, two years, and five years. The thesis that’s true today but fragile tomorrow.

─── ⋆⋅☆⋅⋆ ───

Hi, I’m Mia. Welcome to ROBOTS ATE MY HOMEWORK. On this side of the world, we use AI with a brain and zero circus tricks.

The Greeks didn’t trust a single wind to predict the weather. Neither should you trust a single prompt to pressure-test your thinking.

Why one pressure test catches only a quarter of the problem

I fight with the internet a lot. I doubt everything I read. Every claim feels suspicious until it proves otherwise, which is either a superpower or a problem, I haven’t decided.

Which means that I’ve learned to pressure test ideas in different ways…

When I studied psychiatry for my A levels, I took each clinical study and tore it apart looking for methodological flaws, comparing them to one another, checking if the control groups were comparable or if the researchers had cherry-picked their results. I remember falling in love with the Zimbardo prison experiment, then realizing the whole thing was methodologically compromised (debatable, still, but I stand my ground!). The study I thought was SO brilliant was also a cautionary tale about experimenter bias.

At uni, I doubted all the camera angles and spent AGES in post editing, cutting scenes to change the narrative. I’d shoot something one way, then re-cut it three different ways to see which version was true to what happened vs which version just felt better.

AI made me lazy about this. Not because I stopped caring, but because... you can just ask Perplexity, or tell Claude to be critical, and it looks done. But it’s the same single angle, just faster.

A few months ago I created a thinking partner skill. I’d tested it thoroughly, felt great about all the angles I’d checked.

Within a week of testing (this was before making this a public artifact), I started using it in different contexts / showing it to different people / letting others try it. Four critiques came back from four different directions and I realized my AI pressure testing had missed all of them, because I’d been running the same test in slightly different flavors.

It was also NOT the first time this happened. Before staking a strong claim, I’d reach for one of four methods, sometimes two, never all four:

The devil’s advocate pass. I’d argue against my own thesis and hunt for weak evidence or logical gaps. I love this step for its strength in catching surface vulnerabilities because it argues against the thesis as it exists right now, in a vacuum.
The skeptical friend. I’d send drafts and first passes to my husband to push back on them and to a friend that wasn’t familiar with my angle but knew about the topic. The “skeptical friend” generally catches framing that’s unclear, or shaky examples. Your friend, though, reads generously, and reads passionately, and also… shares your general worldview, usually. On the other hand, your audience reads fast, reads partially, reads through the lens of whatever they thought about before opening your post.
The overnight sit. I’d sleep on it, and come back the next day with fresh eyes. This sometimes works, sometimes doesn’t. Arguments can feel right the night before and hollow the morning after. It’s as unpredictable as a good vs bad sleep, but can be useful.
The AI “be critical” prompt. I have quite a few “be critical” prompts I use with AI from time to time, from different angles (like, an annoyed subscriber persona). I ask for rough feedback and it feels systematic and thorough, BUT “be critical” is a single instruction creating a single angle of critique.

These four methods are all useful and rigorous within their own lane, but I kept walking away feeling thoroughly tested while being blind to three categories of failure.

Each method catches ONE thing, but misses THREE:

Devil’s advocate catches surface vulnerabilities, but misses:
1. Historical precedent (where this reasoning already failed)
2. Audience misreading (how readers will interpret)
3. Long-term fragility (what breaks over time)
Skeptical friend catches unclear framing, but misses:
1. Historical precedent
2. Audience misreading (because your friend reads, literally, like a friend)
3. Long-term fragility
Overnight sit catches timing issues but misses:
1. Historical precedent
2. Audience misreading
3. Long-term fragility
AI “be critical” prompt catches one angle, but misses the other three angles

If you look online for how to get AI to pressure-test your arguments, you’ll find tons of rants. People who’ve tried single-angle critique tools and got stuck. I found Reddit threads full of writers asking how to get AI to do more than surface-level pushback.

Bandi, Bandi, and Harrasse’s research on adversarial multi-agent evaluation of LLMs found that one good critique still misses the problems that only show up when multiple angles collide.

I couldn't find a tool that did this, so I built one.

I first ran four tests independently, because that’s obviously better than running a single one. However, that missed the actual interaction between the four pressures.

I tested the same thesis I’d been running through the process above, the one about AI as a thinking partner. This time, all four winds, separately:

❄️ Boreas found a counter-argument I hadn’t considered, that the “thinking partner” framing still positions AI as the subordinate, which undermines the very collaboration I was advocating. On its own, I could manage that.

🏛️ Notus found that nearly identical “tool vs. partner” arguments had been made about calculators in the 1980s and about search engines in the 2000s, and both times the distinction collapsed within a decade as the technology got absorbed into ordinary workflow. I could acknowledge the pattern and move past it.

👥 Eurus found that readers in my audience who already use AI heavily would likely misread “thinking partner” as a criticism of their current workflow, which was the opposite of my intent. One reframed paragraph would fix it.

I had three independent findings, and all three felt manageable.

Except…

On its own, Boreas’s counter-argument was manageable (AI as a subordinate, not a partner). I could address the “subordinate positioning” concern with a reframed paragraph. But Notus had just shown that this exact “tool vs. partner” distinction had collapsed twice before. Suddenly Boreas’s objection became a structural weakness with historical evidence behind it.

And the audience misreading Eurus predicted meant that the readers most likely to engage with the piece would get defensive from the start without reaching the part that built the whole argument.

The three findings were basically one compound failure I couldn’t see from any single angle.

As a side note - Zephyrus checked the time horizon and found nothing urgent, which meant this thesis wasn’t going to expire in six months. But the other three winds had already found the problem.

Three “okay” test results don’t add up to one safe result. They compound into one risk you can only see when a fifth perspective reads all four and maps the interaction.

The synthesis layer (the 🧭 Synthesizer) is the reason the FOUR WINDS had to be a full agent.

Let me show you the inside.

Building a weather station for arguments

Inside the FOUR WINDS, the agent that runs four adversarial passes and one synthesis.

The FOUR WINDS AI agent lives in RobotsOS. Here’s how it works and how you can download and use it yourself:

How four Greek winds became an AI agent that attacks arguments from every direction

A pressure-testing AI agent built on four adversarial perspectives and one rule: they can't talk to each other.

Why one pressure test catches only a quarter of the problem

When all four winds blow at once

Building a weather station for arguments

This post is for paid subscribers