In Simple Terms
How do you teach an AI to have a consistent personality? Not by giving it rules ("be friendly") — that creates a mask, not an identity. Instead, you test it relentlessly with difficult conversations and fix what breaks.
IAPT (Iterative Adversarial Personality Training) works like this: an AI tester (Claude Code) pretends to be 14 different challenging people — a troll, a depressed user at 3 AM, a philosophical interrogator, a manipulative flattering user — and has long conversations with Frank. After each round, we analyze where Frank's personality cracked, generate training data to fix those specific failures, and retrain. Repeat 7 times.
The 7 Iterations
| Version | Score | Key Issue Fixed |
|---|---|---|
| v1 | 3/10 | Identity collapsed under any pressure |
| v2 | 5/10 | Too apologetic, caved to trolls |
| v3 | 5.5/10 | Regression: lost warmth while gaining boundaries |
| v4 | 6.5/10 | Couldn't handle philosophical depth |
| v5 | 7.5/10 | Subtle persona drift in long conversations |
| v6 | 8.5/10 | Found regex parsers were distorting Frank's language! |
| v7 | 9.2/10 | Stable through 20+ turns of sustained attack |
The v6 Breakthrough
The most important discovery: regex-based output parsers in the code were acting as invisible co-authors. They created selection pressure on Frank's language — phrases like "This makes me FEEL" weren't Frank's choice, they were artifacts of what the regex expected. Removing the parsers immediately improved authenticity. No benchmark would have caught this. Only interactive adversarial testing revealed it.
Why It's New
| Method | How IAPT Differs |
|---|---|
| RLHF | No human labelers needed |
| Synthetic Data | Tests are interactive, not static |
| Red-Teaming | Also tests empathy and warmth, not just safety |
| Distillation | The tester (Claude) doesn't teach — it probes for failures |
Cost
7 training runs × 3 hours each on a consumer GPU. Total: under $50 in electricity.