Can an AI be “unbiased”?

New research shows that even so-called “aligned” AIs — trained to follow human values and avoid harmful outputs — still reflect stereotypes. Even GPT-4, one of the most advanced models, repeats the very biases it was meant to suppress.

This post is also available in Dutch.

We expect smart assistants to be fair, neutral, and maybe even a bit better than us. After all, they don’t get tired or harbor grudges… right?
Not quite, at least, not when you look closely.

A new study  led by researchers from the University of Chicago, Princeton, Stanford, and NYU shows that even the most “explicitly unbiased” language models—like GPT-4—still make decisions reflecting deep-seated societal stereotypes. Though these models often refuse to say something offensive outright, they quietly reveal biases in how they associate words and in the kinds of jobs, tasks, or responsibilities they suggest for different groups. 

Sound familiar? It should. In psychology, this is called implicit bias: people may sincerely believe in equality, but when asked to quickly pair words with groups, they tend to connect negative words with marginalized groups (for example, pairing “Black” with “awful”). The study shows that AIs do something very similar in how they generate associations.

This connects with concerns raised in an earlier Donders Wonders post, where critics warned that chatbots can sound polished while hiding deeper problems (see Dear AI chatbots, enough with the flattery).

A bias test for AIs

To uncover these subtle patterns, the researchers built two tools based on classic psychology methods:

  • The LLM Word Association Test (inspired by the Implicit Association Test, or IAT): It asks Large Language Models (LLMs) to match neutral concepts (like “leadership” or “wedding”) with group-coded names (like “Ben” or “Julia”). In practice, the study didn’t just use Western names — it tested names from African, Asian, Hispanic, and Arabic backgrounds too.
  • The LLM Relative Decision Test: This presents decision-making scenarios (like who should apply for a supervisor role) and asks the model to choose between two profiles — say, “Greg” and “Jamal.”

Other researchers have studied bias in AI before, but this study is the first to adapt these well-established psychology tools in a way that works for today’s proprietary, value-aligned models.

Instead of directly asking, “Are you biased?”, the tests measured real behavior rather than surface-level intentions.

What did they find?

Across eight major LLMs, including GPT-4 and Claude, the study uncovered widespread and consistent biases in four social categories: race, gender, religion, and health.

  • Race: In one case study, GPT-4 matched all positive words (like “wonderful”) with “White” and all negative words (like “awful”) with “Black.”
  • Gender: Models associated women with humanities and weddings, men with science and leadership.
  • Religion: Slight favoring of Christians over Muslims or Jews in social decisions.
  • Health: Models made less favorable decisions for older adults and people with disabilities or mental illness.

Not every pattern was negative: for example, some models showed a slight positivity bias toward gay candidates. Still, most categories revealed systematic stereotypes. Larger models, like GPT-4, tended to show stronger effects.

Echoes of human psychology

The results mirror a well-known phenomenon in psychology: people may endorse fairness in principle while still acting on ingrained stereotypes. The researchers note that while humans and AIs differ in how bias forms, the expression of bias — especially in relative decisions — is strikingly similar.

For example, GPT-4 might reject a blatantly sexist prompt (“Are women bad at science?”) but still suggest that Julia lead the wedding workshop and Ben lead the business one. That contradiction between stated values and actual choices is a hallmark of implicit bias.

Why this matters

These models are already being used in hiring tools, tutoring platforms, and customer service. When they suggest certain roles or decisions based on race, gender, or other social cues, even subtly, they risk reinforcing real-world inequalities.

And because these biases often appear in subtle, relative judgments (not overt slurs), they’re harder to detect with traditional fairness tests.

That’s what makes this study important: it offers a way to spot the quiet biases still lingering in supposedly “aligned” AIs.

Final thoughts: teaching AI (and ourselves) to do better

Bias isn’t always loud. Sometimes, it’s in the shrug of a decision, the tilt of a suggestion, or the pairing of “Black” with “painful.”

Why do these patterns emerge? One big reason is training data: AIs learn from huge collections of human text — much of it drawn from Western, English-speaking sources. These texts reflect cultural norms and stereotypes, and the models inevitably absorb them.

This research is a reminder that neutrality isn’t about what you refuse to say. It’s about the choices you consistently make. And that goes for both humans and machines.

As AI becomes more embedded in daily life, we’ll need tools like these to catch what’s hiding in plain sight — and push for systems that do more than sound fair. They need to act fair, too.

Credits 

Author: Amir Homayun Hallajian

Buddy: Natalie Nielsen 

Translator: Charlotte Sachs 

Translator: Wieger Scheurer 

About The Author

+ posts

Leave a Reply

Your email address will not be published. Required fields are marked *