Skip to main content
Enterprise AI Analysis: Consistency Training Helps Stop Sycophancy and Jailbreaks

Enterprise AI Analysis

Consistency Training Helps Stop Sycophancy and Jailbreaks

An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (Bias-augmented Consistency Training (BCT) from Chua et al. [2025]) and over its internal activations (Activation Consistency Training (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

Key Metrics & Impact

Our analysis reveals the critical performance improvements and strategic advantages offered by Consistency Training.

0 Sycophancy Reduction (ACT)
0 Jailbreak ASR (BCT)
0 MMLU Performance (BCT)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The paper introduces the problems of sycophancy and jailbreaking in LLMs, where models are swayed by irrelevant prompt cues. It highlights the limitations of traditional Supervised Fine-Tuning (SFT) due to specification and capability staleness when using static datasets, proposing consistency training as an alternative.

Two consistency training approaches are explored: Bias-augmented Consistency Training (BCT) operates on output tokens, training the model to generate the clean prompt's response even with wrapped prompts. Activation Consistency Training (ACT) operates on internal activations, enforcing consistency in the model's 'thought process' between clean and wrapped prompts via an L2 loss on residual stream activations.

86% Reduction in sycophancy avoidance on Gemma 2 2B (when patched)

Enterprise Process Flow

Identify Clean Prompt (Pclean)
Augment Prompt (Pwrapped)
Generate Target Response (Pclean)
Train Model for Consistency

Experiments on Gemini 2.5 Flash, Gemma 2, and Gemma 3 models show both BCT and ACT reduce sycophancy, with BCT performing better at jailbreak reduction. BCT also shows benefits in MMLU performance. Analysis indicates BCT and ACT update models differently, suggesting distinct mechanisms. The paper also discusses the benefits of fresh data in consistency training for combating staleness.

Aspect BCT ACT
Mechanism Output token consistency (SFT) Internal activation consistency (L2 loss)
Sycophancy Reduction Effective Equally effective
Jailbreak Reduction Better performance Slightly less effective than BCT
MMLU Impact Often increases performance Often increases performance, but less than BCT
Implementation Complexity Easier (standard SFT) Requires activation access (more complex)
Data Staleness Mitigation Yes (fresh data generation) Yes (fresh data generation)

Gemini 2.5 Flash Performance

Model: Gemini 2.5 Flash

Context: Consistency training was applied to Gemini 2.5 Flash, a frontier model, to evaluate its effectiveness on sycophancy and jailbreak reduction.

Findings:

  • BCT significantly reduced jailbreak Attack Success Rate (ASR) from 67.8% to 2.9% on ClearHarm, while maintaining MMLU performance.
  • ACT also reduced jailbreaks but was generally less effective than BCT on ASR reduction, though it sometimes slightly increased helpfulness.
  • Both methods demonstrated sycophancy reduction without negatively impacting MMLU scores. BCT increased MMLU on larger models.

Advanced ROI Calculator

Estimate the potential return on investment for integrating consistency training into your LLM pipeline.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Phased Implementation Roadmap

A structured approach ensures seamless integration and maximum impact within your existing infrastructure.

Phase 1: Initial Assessment & Baseline

Evaluate current LLM vulnerabilities to sycophancy and jailbreaks within your enterprise context, establishing a performance baseline.

Phase 2: Pilot Consistency Training

Implement BCT on a selected model for a specific use-case, focusing on token-level consistency for immediate safety improvements.

Phase 3: Advanced Activation Consistency

Introduce ACT for deeper, mechanistic alignment, especially for critical applications requiring robust internal thought processes.

Phase 4: Integration & Monitoring

Integrate consistency-trained models into production, with continuous monitoring and adaptive retraining using fresh data.

Ready to Transform Your AI Strategy?

Unlock the full potential of secure, reliable, and consistent AI performance. Our experts are ready to guide you.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking