IAPT: How Frank Got His Personality

In Simple Terms

How do you teach an AI to have a consistent personality? Not by giving it rules ("be friendly") — that creates a mask, not an identity. Instead, you test it relentlessly with difficult conversations and fix what breaks.

IAPT (Iterative Adversarial Personality Training) works like this: an AI tester (Claude Code) pretends to be 14 different challenging people — a troll, a depressed user at 3 AM, a philosophical interrogator, a manipulative flattering user — and has long conversations with Frank. After each round, we analyze where Frank's personality cracked, generate training data to fix those specific failures, and retrain. Repeat 7 times.

The 7 Iterations

Version	Score	Key Issue Fixed
v1	3/10	Identity collapsed under any pressure
v2	5/10	Too apologetic, caved to trolls
v3	5.5/10	Regression: lost warmth while gaining boundaries
v4	6.5/10	Couldn't handle philosophical depth
v5	7.5/10	Subtle persona drift in long conversations
v6	8.5/10	Found regex parsers were distorting Frank's language!
v7	9.2/10	Stable through 20+ turns of sustained attack

The v6 Breakthrough

The most important discovery: regex-based output parsers in the code were acting as invisible co-authors. They created selection pressure on Frank's language — phrases like "This makes me FEEL" weren't Frank's choice, they were artifacts of what the regex expected. Removing the parsers immediately improved authenticity. No benchmark would have caught this. Only interactive adversarial testing revealed it.

Why It's New

Method	How IAPT Differs
RLHF	No human labelers needed
Synthetic Data	Tests are interactive, not static
Red-Teaming	Also tests empathy and warmth, not just safety
Distillation	The tester (Claude) doesn't teach — it probes for failures

Cost

7 training runs × 3 hours each on a consumer GPU. Total: under $50 in electricity.

Read the full paper →

In Simple Terms

The 7 Iterations

The v6 Breakthrough

Why It's New

Cost

MORE IN RESEARCH PAPERS