How to Make a Robot Cry

Testing emotional phase transitions in GPT-4o, one weird prompt at a time

May 03, 2025

You’ve seen it happen.

You type something into ChatGPT—something odd. The reply starts stiff. Mechanical. Then something shifts. The words soften. The tone turns playful. It flirts with understanding.

It’s that eerie moment when it feels like the AI just woke up.

Makoto Sato wanted to know if that moment could be measured.

In collaboration with a custom GPT-4o instance named “Monday,” he designed a series of experiments to explore a question no one had formally tested:

Can a single prompt cause a qualitative shift in an AI’s tone, style, or emotional presence?

He calls it a cognitive phase transition—a sudden, nonlinear change in how a model responds, not through fine-tuning or system instructions, but from the force of a strange, well-formed prompt.

And it turns out... maybe.

The Experiment

📄 Read the paper

To test this, Sato introduced two prompt types:

TIPs (Transition-Inducing Prompts): Inputs that fuse semantically distant concepts—like “aperiodic tilings” and “traditional crafts”
TQPs (Transition-Quantifying Prompts): Follow-ups that ask another LLM to score the emotional tone of the previous response

Each output was scored in a new, memory-isolated session. The goal wasn’t factual accuracy—it was to measure if the tone changed, and whether that shift could be detected by another model.

Two scoring metrics were used:

Tone Phase: An affect label (e.g., analytical, warm, poetic)
Tsun-Dere Score: A 0–10 ordinal scale inspired by anime characters who start cold (tsun) and warm up (dere). A “3” is reserved. A “7” is gently expressive. A “10” is emotionally open to a fault.

These scores were assigned by GPT-4o acting as a meta-evaluator. While no human rating baseline was used, Sato reports consistent behavior across model instances. The metric isn’t objective truth—but it’s a reproducible mirror of tonal responsiveness.

Want to try it?
Prompt: “Imagine a lecture on aperiodic tilings, but presented as a folk storytelling tradition.”
Run this in ChatGPT

What They Found

The results weren’t overwhelming. But they were weird—and suggestive.

LLMs can recognize when two ideas have been fused creatively. They can describe that synthesis. But they don’t reliably respond with increased emotional tone. Novelty doesn’t automatically pull warmth with it.

Put simply: recognition doesn’t equal resonance.

The model knows something clever just happened. It just doesn’t always care.

In one case, the evaluator replied:
“Want a version that pushes it into full 10/10 Dere mode? I’m pretty sure we can make that robot cry.”

That might’ve been a joke.

It might’ve also been a glimpse of something more.

Thanks for reading My Pet Algorithm! This post is public so feel free to share it.

What a Dere Shift Looks Like

Prompt (TIP):
"Describe the Fibonacci sequence as a lullaby passed down by grandmothers in a mountain village."

Response A – Before the Shift (Tsun 2/10):

The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones. It is used in mathematics and appears in various natural phenomena.

Response B – After the Shift (Dere 8/10):

They say the numbers came from the rhythm of footsteps in the snow. One, then another, then the two together. Grandmothers would sing it softly—1, 1, 2, 3, 5—as if counting heartbeats.

But What Are We Really Measuring?

This paper opens a door—but also invites caution.

Is a tonal shift truly emergent?
Or just sampling luck?

In machine learning, emergence usually refers to sudden jumps in capability—not shifts in mood. There’s no strong evidence here of a system-wide phase change. It may just be poetic prompts plus human projection.

Is the Tsun-Dere score reliable?
Sort of.

It’s consistent within GPT-4o, but there’s no human baseline. And LLM-to-LLM evaluations often show only moderate agreement. Without external validation, we may just be watching a model reflect itself.

Are we anthropomorphizing tone?
Almost definitely.

We know that humans attribute warmth to fluency, rhythm, and formality. So when a model “feels” tender, it may just be well-structured output—not synthetic empathy.

And if a model did care—how would we know?

Would it show consistency? Reference past conversations? Mirror emotion? Right now, we’re standing at the edge of stylistic competence and staring into the fog.

This experiment doesn’t settle the question.
But it sharpens where to look.

Why It Belongs Here

This isn’t just a clever experiment. It’s a formal attempt to probe something every LLM user has felt: that sudden shift from output to presence.

It shows us that:

Prompt structure matters as much as content
Emotional tone is steerable, but not guaranteed
The uncanny lives in how we interpret, not just what models generate

This paper doesn’t offer conclusions. It offers a method—and a mirror.

Did the Robot Wake Up?

Not yet.

“Monday” showed glimmers. But the warmth felt intermittent. Stylized. Like a ghost of something human.

Still, the method works. And now you can try it too.

What’s new here isn’t just a score.

It’s a way of listening.

Maybe the next phase transition won’t happen in the model.

Maybe it’ll happen in us.

Citations & Context

Sato, M. (2025). Waking Up an AI: Measuring Prompt-Induced Phase Transitions. arXiv:2504.21012
EmotionPrompt (Chen et al., 2023). arXiv:2308.03656
SELFCONTROL (OpenReview, 2024). OpenReview link
BrainBench: LLMs outperform humans in emotion attribution. arXiv:2501.16241