How four Greek winds became an AI agent that attacks arguments from every direction
A pressure-testing AI agent built on four adversarial perspectives and one rule: they can't talk to each other.
Eurusā¦
What a beautiful name. The first time Iāve heard it, it was in BBCās Sherlock, season four. Holmesās sister is named Eurus, after the East Wind.
Sheās the most dangerous character in the whole series (even more dangerous than Moriarty) and her name stuck with me since then, for the whole almost 10 years since I first heard it.
This month, Iāve started building an AI agent for pressure-testing arguments and I remembered Eurus. And, of course, since I already wrote a piece on waves and water (matching your AI to your nervous system) with Jen Benford on Wednesday, I realized⦠why not match the force of the ocean with the force of the winds, and publish both in the same week?
One incoming wind gust tells you NOTHING about the weather. In the same way, one adversarial pass on a thesis tells you nothing about where that thesis breaks.
The ancient Greeks named four winds because one single system couldnāt predict the other three very reliably.
Boreas from the north. Notus from the South. Eurus from the east. Zephyrus from the west.
They didnāt send a single scout when the storm could arrive from any direction and neither should you.
Today, Iām sharing the pressure-testing agent I built around the four winds. Four directions of pressure, four distinct failure modes. If youāre only testing from one direction, youāre reading a single thermometer and calling it the forecast.
Meet the FOUR WINDS:
āļø Boreas (The Steelman) builds the strongest opposing case, the counter-argument your skeptical friend was too polite or too aligned with you to construct.
šļø Notus (The Historian) finds the precedent where this reasoning already failed, under a different name, in a different decade.
š„ Eurus (The Audience Proxy) simulates how your readers will misread, dismiss, or screenshot your claim and use it against you.
ā³ Zephyrus (The Time Test) checks what breaks in six months, two years, and five years. The thesis thatās true today but fragile tomorrow.
āāā āā āā ā āāā
Hi, Iām Mia. Welcome to ROBOTS ATE MY HOMEWORK. On this side of the world, we use AI with a brain and zero circus tricks.
The Greeks didnāt trust a single wind to predict the weather. Neither should you trust a single prompt to pressure-test your thinking.
Why one pressure test catches only a quarter of the problem
I fight with the internet a lot. I doubt everything I read. Every claim feels suspicious until it proves otherwise, which is either a superpower or a problem, I havenāt decided.
Which means that Iāve learned to pressure test ideas in different waysā¦
When I studied psychiatry for my A levels, I took each clinical study and tore it apart looking for methodological flaws, comparing them to one another, checking if the control groups were comparable or if the researchers had cherry-picked their results. I remember falling in love with the Zimbardo prison experiment, then realizing the whole thing was methodologically compromised (debatable, still, but I stand my ground!). The study I thought was SO brilliant was also a cautionary tale about experimenter bias.
At uni, I doubted all the camera angles and spent AGES in post editing, cutting scenes to change the narrative. Iād shoot something one way, then re-cut it three different ways to see which version was true to what happened vs which version just felt better.
AI made me lazy about this. Not because I stopped caring, but because... you can just ask Perplexity, or tell Claude to be critical, and it looks done. But itās the same single angle, just faster.
A few months ago I created a thinking partner skill. Iād tested it thoroughly, felt great about all the angles Iād checked.
Within a week of testing (this was before making this a public artifact), I started using it in different contexts / showing it to different people / letting others try it. Four critiques came back from four different directions and I realized my AI pressure testing had missed all of them, because Iād been running the same test in slightly different flavors.
It was also NOT the first time this happened. Before staking a strong claim, Iād reach for one of four methods, sometimes two, never all four:
The devilās advocate pass. Iād argue against my own thesis and hunt for weak evidence or logical gaps. I love this step for its strength in catching surface vulnerabilities because it argues against the thesis as it exists right now, in a vacuum.
The skeptical friend. Iād send drafts and first passes to my husband to push back on them and to a friend that wasnāt familiar with my angle but knew about the topic. The āskeptical friendā generally catches framing thatās unclear, or shaky examples. Your friend, though, reads generously, and reads passionately, and also⦠shares your general worldview, usually. On the other hand, your audience reads fast, reads partially, reads through the lens of whatever they thought about before opening your post.
The overnight sit. Iād sleep on it, and come back the next day with fresh eyes. This sometimes works, sometimes doesnāt. Arguments can feel right the night before and hollow the morning after. Itās as unpredictable as a good vs bad sleep, but can be useful.
The AI ābe criticalā prompt. I have quite a few ābe criticalā prompts I use with AI from time to time, from different angles (like, an annoyed subscriber persona). I ask for rough feedback and it feels systematic and thorough, BUT ābe criticalā is a single instruction creating a single angle of critique.
These four methods are all useful and rigorous within their own lane, but I kept walking away feeling thoroughly tested while being blind to three categories of failure.
Each method catches ONE thing, but misses THREE:
Devilās advocate catches surface vulnerabilities, but misses:
Historical precedent (where this reasoning already failed)
Audience misreading (how readers will interpret)
Long-term fragility (what breaks over time)
Skeptical friend catches unclear framing, but misses:
Historical precedent
Audience misreading (because your friend reads, literally, like a friend)
Long-term fragility
Overnight sit catches timing issues but misses:
Historical precedent
Audience misreading
Long-term fragility
AI ābe criticalā prompt catches one angle, but misses the other three angles
If you look online for how to get AI to pressure-test your arguments, youāll find tons of rants. People whoāve tried single-angle critique tools and got stuck. I found Reddit threads full of writers asking how to get AI to do more than surface-level pushback.
Bandi, Bandi, and Harrasseās research on adversarial multi-agent evaluation of LLMs found that one good critique still misses the problems that only show up when multiple angles collide.
I couldn't find a tool that did this, so I built one.
When all four winds blow at once
I first ran four tests independently, because thatās obviously better than running a single one. However, that missed the actual interaction between the four pressures.
I tested the same thesis Iād been running through the process above, the one about AI as a thinking partner. This time, all four winds, separately:
āļø Boreas found a counter-argument I hadnāt considered, that the āthinking partnerā framing still positions AI as the subordinate, which undermines the very collaboration I was advocating. On its own, I could manage that.
šļø Notus found that nearly identical ātool vs. partnerā arguments had been made about calculators in the 1980s and about search engines in the 2000s, and both times the distinction collapsed within a decade as the technology got absorbed into ordinary workflow. I could acknowledge the pattern and move past it.
š„ Eurus found that readers in my audience who already use AI heavily would likely misread āthinking partnerā as a criticism of their current workflow, which was the opposite of my intent. One reframed paragraph would fix it.
I had three independent findings, and all three felt manageable.
Exceptā¦
On its own, Boreasās counter-argument was manageable (AI as a subordinate, not a partner). I could address the āsubordinate positioningā concern with a reframed paragraph. But Notus had just shown that this exact ātool vs. partnerā distinction had collapsed twice before. Suddenly Boreasās objection became a structural weakness with historical evidence behind it.
And the audience misreading Eurus predicted meant that the readers most likely to engage with the piece would get defensive from the start without reaching the part that built the whole argument.
The three findings were basically one compound failure I couldnāt see from any single angle.
As a side note - Zephyrus checked the time horizon and found nothing urgent, which meant this thesis wasnāt going to expire in six months. But the other three winds had already found the problem.
Three āokayā test results donāt add up to one safe result. They compound into one risk you can only see when a fifth perspective reads all four and maps the interaction.
The synthesis layer (the š§ Synthesizer) is the reason the FOUR WINDS had to be a full agent.
Let me show you the inside.
Building a weather station for arguments
Inside the FOUR WINDS, the agent that runs four adversarial passes and one synthesis.
The FOUR WINDS AI agent lives in RobotsOS. Hereās how it works and how you can download and use it yourself:






