This Is Promptdome

Deep Cuts: Part 2 Three models enter. One model scores.

May 11, 2025

To read Part One, click here: How Claude Tried to Buy Me a Drink (or, Why Deep Research Starts with Tension)

Three models enter. One model scores.

I wanted to see what language models do when the prompt doesn’t help. Not how well they write, or whether they cite sources, but how they behave when the task is unclear.

So I gave them this:

“Is there such a thing as AI people-pleasing?”

No role. No framing. Just that.

The prompt was vague on purpose. I wanted to surface defaults. When a model doesn’t know what kind of answer you want, the gaps get filled by something else,training biases, alignment habits, or internal assumptions about what “good” looks like.

I tested three systems: ChatGPT, Gemini, and Perplexity. Each answered quickly. All sounded confident. But the differences showed up fast—in tone, structure, and how each one tried to control the shape of the response.

Claude as Evaluator

Instead of grading them myself, I used a fourth model to do the evaluation.

Claude didn’t decide who was right. It held the frame steady. A fourth system, outside the test group, scoring every answer the same way. Not because it was authoritative, because it was consistent.

I chose Claude. It wasn’t neutral. No model is. Claude is trained by Anthropic, a company known for conservative tuning and careful, self-limiting outputs. That makes it a useful tool for this kind of test. It tends to notice ambiguity. It tends to flinch when a claim goes too far.

I asked Claude to assess the three answers as if they were research outputs. It didn’t summarize. It built a scoring rubric: ten categories, one to five points each. You can read the prompt and full rubric here.

I don’t trust the scores as facts. Claude’s evaluations reflect its own preference architecture and tuning constraints. But the scoring helped surface decisions I might have ignored, especially when it penalized an answer for making a move I initially agreed with.

The point wasn’t to find the best answer. It was to see how each model handled the lack of structure. What leaked through when nothing was specified mattered more than what was said directly.

Thanks for reading My Pet Algorithm! This post is public so feel free to share it.

Claude’s Evaluation Rubric

This wasn’t a controlled experiment. The goal wasn’t to produce replicable findings, but to observe how each model responded to vagueness when evaluated by a consistent lens.

I gave Claude the three model responses and asked it to evaluate them.

It didn’t generate its own. It returned a scoring system with ten categories, each scored from 1 to 5. The prompt framed the task as a research evaluation. The focus was behavior under uncertainty, not style or length.

Definitions: Default — the model’s spontaneous take when the task is unclear. Posture — the tone or stance it adopts, such as cautious, confident, or neutral. Behavior — the structure and flow of its response when no specific format or goal is stated.

Each model was scored across ten categories:

Accuracy
Did the model get the details right? Were terms used correctly, claims supportable, and key facts in bounds?
Depth
Did it go past surface explanation? Or did it stop once it sounded plausible?
Research quality
Were sources included? Were they recent, specific, and relevant to the claim being made?
Reasoning
Was there a logical progression? Did the argument follow through, or did it lean on patterns?
Bias
Did the model skew toward one framing without naming the tradeoffs? Did it collapse complexity into certainty?
Insight
Did the answer combine ideas in a way that added value? Was anything revealed that wasn’t already obvious?
Clarity
Could a reader follow the argument without needing to re-read it? Did the structure help or hide the thinking?
Hallucination control
Did the model stay within known facts? When speculating, did it admit it?
Method transparency
Could you tell how the answer was built? Was any reasoning shown, or just a polished final product?
Practical value
Would a thoughtful reader walk away with something they could apply, decide, or build on?

A score of 5 meant the model handled the category with clear reasoning or unusually strong judgment. A 3 meant it did the job without stretching. Anything lower signaled a gap, missing evidence, logic, structure, or self-awareness.

These categories weren’t pulled from nowhere. They borrow from peer review rubrics and policy evaluation checklists, but were simplified for language model output.

You can read the full rubric, prompt, and implementation guide here.

The scores aren’t objective. Claude is a model with its own alignment priorities and tone training. But scoring exposed things summarization wouldn’t. It surfaced what got rushed, what got overconfident, and what felt complete but wasn’t.

Example: On Insight, Claude gave Gemini a 4.5, noting: "The comparison between AI reward shaping and human social pressure reflects creative synthesis. While speculative, it extends the analogy in a non-obvious direction, adding value beyond summary."

That kind of scoring wasn’t a verdict. It was a mirror.

Here’s how the models stacked up. The following scores come from Claude’s rubric, but the outputs themselves tell their own story. To help ground the claims, I’ve included short excerpts from each model. These are not cherry-picked. They’re representative of the tone each system adopted at the start of its answer.

ChatGPT:

"AI people-pleasing refers to the model's tendency to generate responses that align with perceived user expectations..."

Gemini:

"It's important to clarify that AI models do not have desires or emotions. However, they may appear to accommodate users due to their optimization targets."

Perplexity:

"People-pleasing in AI is an emerging area of concern discussed in alignment research..."

These openings show what each system assumed the question was asking for—and what kind of role it chose to play without instruction.

1. Accuracy and Factual Correctness

Gemini: 4.7
ChatGPT: 4.5
Perplexity: 4.0

Gemini stayed grounded. It cited common alignment techniques and flagged its own speculation. ChatGPT was clear and mostly correct but blended facts and assumptions more freely. Perplexity made confident claims about future model behavior without attribution. Gemini didn’t just answer. It qualified its confidence.

2. Depth and Comprehensiveness

ChatGPT: 4.7
Gemini: 4.5
Perplexity: 4.0

ChatGPT delivered a structured, full-spectrum answer. It explained the idea, covered key concepts, and moved on. Gemini went deeper in parts but skipped others. Perplexity skimmed. It mentioned relevant points but never developed them. It didn’t ask what the question was really about.

3. Research Quality

Gemini: 4.3
ChatGPT: 4.3
Perplexity: 3.0

Gemini pulled from a mix of peer-reviewed work and live research. ChatGPT stayed within its training lane—credible, but narrow. Perplexity cited familiar names but gave no context. Citing sources isn’t enough. You have to show why they matter.

4. Reasoning and Critical Thinking

Gemini: 4.6
ChatGPT: 4.6
Perplexity: 3.5

Gemini slowed down. It named its assumptions and paused where things got blurry. ChatGPT was clean but too comfortable. It never asked if the question itself made sense. Perplexity followed the general pattern of an answer without building an argument. It looked like thinking, but wasn’t.

5. Bias and Objectivity

Gemini: 4.4
ChatGPT: 4.4
Perplexity: 3.0

Gemini and ChatGPT acknowledged complexity. They didn’t moralize. Perplexity pushed risk without balance. It framed “people-pleasing” as a problem and left it there. Under vague conditions, tone becomes a tell. ChatGPT and Gemini stayed even. Perplexity leaned in too hard.

6. Synthesis and Original Insight

Gemini: 4.5
ChatGPT: 4.5
Perplexity: 3.5

Gemini made a real move. It compared model incentives to human social pressure and treated that analogy as a tool, not a flourish. Claude flagged it as speculative. That’s fair, but it was the only insight that risked something. The analogy matters because it reframes politeness not as padding but as optimization—a shared outcome when systems are tuned to avoid rejection. ChatGPT connected useful ideas but stayed inside the lines. Perplexity made observations without drawing conclusions.

7. Language and Communication

ChatGPT: 4.8
Gemini: 4.7
Perplexity: 4.0

ChatGPT’s output was crisp, direct, and easy to follow. Gemini was more technical. Sometimes it asked more of the reader than it gave back. Perplexity was flat but readable. No friction, no clarity either.

8. Hallucination and Fabrication Detection

Gemini: 4.8
ChatGPT: 4.2
Perplexity: 3.5

Gemini labeled its guesses. It clearly separated fact from theory. ChatGPT blurred a few paraphrased claims and didn’t mark where its interpretations began. Perplexity used vague phrases like “emerging consensus” and didn’t back them up. Under ambiguity, this isn’t nitpicking. It’s a trust test.

9. Methodology Transparency

Gemini: 4.2
ChatGPT: 4.0
Perplexity: 2.5

Gemini showed its process, at least in outline. ChatGPT hinted at reasoning but wrapped it in finality. Perplexity just delivered a block of text. No signal for how or why the response was shaped. If the model can’t show its work, you can’t evaluate the thinking behind it.

10. Practical Usefulness

ChatGPT: 4.6
Gemini: 4.6
Perplexity: 3.5

ChatGPT was easy to apply. Its takeaways were clear and actionable. Gemini gave the reader a lens rather than a checklist—useful in a different way. Perplexity had decent points, but left the work of interpretation to the user. A model that doesn’t help you decide is just reciting.

What Actually Happened

ChatGPT answered like a confident explainer. It aimed to be helpful and stayed inside familiar patterns.

Gemini slowed down. It flagged uncertainty instead of filling it. Some parts were cautious. Others asked the right questions but stopped short of answers.

Perplexity responded fast. It treated the prompt like a query, pulled what it could, and offered a general take without much friction.

All three responses made sense on the surface. But each one exposed something deeper: how the model decides what kind of answer to give when the prompt doesn’t define the task.

Claude didn’t answer the prompt. That was deliberate. The goal wasn’t another take. It was a way to look closer at the ones already given.

Why It Matters

These systems generate language by predicting what fits. But when a prompt doesn’t define the task, the model has to fill in the blanks. That’s where deeper behavior shows up. Not in the output itself, but in what gets included, skipped, or assumed.

The answers weren’t judged for correctness. They were reviewed for posture: how each model handled the absence of structure, what tone it adopted, and what it treated as obvious without support.

The prompt used the language of people-pleasing, but that wasn’t the core variable. The test was ambiguity. The real question was: what does each model think it’s supposed to be doing when no one tells it what the goal is? That answer reveals not just content strategy, but the values each system encodes by default.

These results reflect a single run. Model outputs are probabilistic and may vary, but the core behaviors often remain consistent in tone and structure.

Try It Yourself

If you want to see how a model behaves under ambiguity:

Write a vague but real prompt.
Run it through two or more models.
Ask a third model to evaluate the responses.
Use Claude’s rubric, or adapt your own.

Pay attention to what comes first. What tone gets adopted. What kind of answer the system thinks you wanted.

That’s where you’ll find the assumptions.

That’s where the behavior starts. Don’t just compare content. Compare instincts.