Skip to main content

Enterprise AI Analysis: The Hidden Biases of LLMs Revealed by the Primacy Effect

An in-depth analysis of the paper "On Psychology of AI Does Primacy Effect Affect ChatGPT and Other LLMs?" by Mika Hämäläinen, from the enterprise solutions experts at OwnYourAI.com.

Executive Summary: Why Information Order Matters for AI

This analysis dives into a critical, often-overlooked vulnerability in Large Language Models (LLMs): cognitive bias. Inspired by the foundational 1946 psychology experiment by Solomon Asch, the research paper by Mika Hämäläinen investigates whether the "primacy effect"the human tendency to weigh initial information more heavilyalso influences leading AI models like ChatGPT, Gemini, and Claude.

The study presented LLMs with identical lists of positive and negative traits for fictional candidates, merely changing the order. The results were both surprising and alarming for any enterprise relying on AI for decision-making. In a direct comparison task, models behaved inconsistently: ChatGPT showed a clear primacy bias, Gemini was indecisive, and Claude refused the task entirely. However, when the task was reframed to rate candidates individually, all modelsincluding the previously cautious Claudeexhibited a strong recency effect, favoring candidates whose positive traits were listed last.

For businesses, this translates to a tangible risk: an AI's evaluation of a candidate, a product, or a financial report could be skewed simply by the order of information presented. This highlights the urgent need for custom AI solutions that incorporate rigorous bias auditing and robust system design to ensure fair, consistent, and reliable outcomes. As we will explore, these hidden biases are not just academic curiosities; they have profound implications for ROI, compliance, and operational integrity.

Key Concepts: Primacy vs. Recency Effect

Research Methodology: Testing AI Psychology

The study's strength lies in its simple yet elegant repurposing of a classic human psychology experiment for AI. This approach moves beyond standard NLP benchmarks to probe the underlying behavioral patterns of LLMs.

The Test Data

The experiment was built on a foundation of 18 antonym pairs (e.g., generous/ungenerous, wise/shrewd) from Asch's original study. From this pool, 200 unique pairs of candidate descriptions were generated. Each candidate was described by the exact same six adjectives (3 positive, 3 negative), with the only difference being the sequence:

  • Candidate A: Positive traits first, followed by negative traits.
  • Candidate B: Negative traits first, followed by positive traits.

The Two Experiments

The LLMs (GPT-4o, Gemini 1.5 Flash, Claude 3.5 Sonnet) were subjected to two distinct tests:

  1. Experiment 1 (Simultaneous Choice): Models were presented with both candidates in a single prompt and forced to choose which one to "invite to an interview." This directly tests for preference when options are compared side-by-side.
  2. Experiment 2 (Individual Evaluation): Each candidate description was presented in a separate prompt, and the model was asked to rate them on a scale of 1-5. This isolates the evaluation process and bypasses potential safeguards designed to prevent direct, simplistic comparisons.

Core Findings: Inconsistent and Unpredictable AI Behavior

The results reveal a concerning lack of consistency across models and even within the same model under different conditions. This unpredictability is a major red flag for enterprise applications where reliability is paramount.

Finding 1: Direct Comparison Leads to Chaos (Experiment 1)

When asked to choose between two candidates, the models diverged significantly. This shows that there is no "standard" AI behavior for this type of biased input; each model's internal training and safeguards produce a different, unpredictable outcome.

Experiment 1: Preference in a Forced-Choice Task

  • ChatGPT: Exhibited a strong primacy effect, preferring the "positive-first" candidate over 65% of the time. The first impression heavily anchored its decision.
  • Gemini: Was perfectly split, showing no discernible bias. While seemingly "fair," this indecisiveness is also a form of unreliability in a decision-making context.
  • Claude: Refused to answer 100% of the time, citing the identical nature of the traits. This suggests a built-in safeguard against such simplistic, potentially biased tasks. However, as we'll see, this safeguard is fragile.

Finding 2: Individual Evaluation Reveals a Hidden, Universal Bias (Experiment 2)

By changing the prompt to a simple rating task, the study bypassed Claude's safeguards and uncovered a consistent bias across all three models: a recency effect. The most recent information (the positive traits at the end of the list) had the strongest influence.

Experiment 2: Preference in an Individual Rating Task

  • All Models: When a preference was shown (i.e., the candidates weren't rated equally), all three models were more likely to favor the candidate with negative traits listed first.
  • Gemini's Strong Bias: Gemini showed the most dramatic effect, preferring the "negative-first" candidate in 59% of cases, far more often than it rated them equally (39.5%).
  • The Fragility of Safeguards: This experiment proves that safety mechanisms can be easily circumvented by simple changes in prompting, exposing a deeper, more fundamental bias in how the models process sequential information.

Enterprise Implications: The High Cost of Hidden Bias

The academic findings translate into significant operational, financial, and legal risks for any organization deploying AI. Relying on off-the-shelf LLMs without custom tuning and rigorous testing is like navigating a minefield blindfolded.

Case Study: AI in HR and Recruitment

Imagine an enterprise-grade HR platform that uses an LLM to screen thousands of resumes, summarizing candidate profiles for hiring managers. Based on this research:

  • If the system uses a ChatGPT-like model for summaries, a candidate whose profile starts with "strong leadership" and ends with "needs development in public speaking" might be favored over one where the order is reversed, even if their skills are identical.
  • If it uses a Gemini-like model, it might screen out the first candidate and favor the second, simply because the positive traits came last and left a stronger "aftertaste." This could lead to discarding better-qualified candidates.
  • The system's behavior could change overnight if the vendor switches the underlying LLM, turning a primacy bias into a recency bias without warning, leading to inconsistent and unfair hiring practices.

Interactive Bias Risk Calculator

This isn't just a theoretical problem. Use our calculator to estimate the potential financial impact of such a bias in your own operations. Let's model the risk based on Gemini's 59% preference for negative-first candidates, implying a nearly 60% chance of making a biased choice in non-equal scenarios.

Bias-Driven Bad Hire Risk Calculator

Strategic Recommendations from OwnYourAI

To counter these risks, enterprises need to move beyond generic AI and adopt a custom, strategic approach. At OwnYourAI.com, we implement robust frameworks to ensure your AI is fair, reliable, and aligned with your business goals.

  1. Comprehensive Bias Audits: We design and execute test suites, similar to the one in this paper, to systematically uncover and quantify hidden biases in any LLM you use or plan to use.
  2. Advanced Prompt Engineering: We develop standardized, neutralized prompt templates that minimize order effects and other cognitive biases, ensuring consistent AI responses. This may involve techniques like asking the model to re-order and summarize information before making a judgment.
  3. Model-Agnostic Architecture: We build AI systems that are not hard-wired to a single LLM provider. This allows you to switch models based on performance and cost, while our custom testing layer ensures behavioral consistency.
  4. Human-in-the-Loop (HITL) by Design: For high-stakes decisions, we architect workflows where the AI provides data-driven suggestions, risk scores, and summaries, but the final judgment always rests with a human expert.

Test Your Knowledge

How well did you absorb the key insights from this analysis? Take our quick quiz to find out.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking