AI RESEARCH ANALYSIS

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

This paper introduces 'safe-completions', a novel safety-training approach for Large Language Models (LLMs) that shifts focus from binary refusal boundaries (input-centric) to the safety and helpfulness of the model's output. Unlike traditional refusal training, which can be brittle with dual-use prompts, safe-completions aim to maximize usefulness while strictly adhering to safety policies. The method penalizes unsafe outputs based on severity and rewards helpfulness (direct or indirect) within safety constraints. Implemented in GPT-5, it shows improved safety for dual-use prompts, reduced severity of residual safety failures, and increased helpfulness compared to refusal-based models like o3, validated through both automated and human evaluations.

Schedule Your Enterprise AI Strategy Session

Executive Impact

Traditional LLM safety training relies on binary refusals for harmful prompts, a method prone to brittleness, especially for dual-use cases where user intent is ambiguous. The 'safe-completions' paradigm, implemented in GPT-5, reframes safety by focusing on output safety and maximizing helpfulness within policy constraints. This approach penalizes unsafe outputs proportionally to severity and rewards constructive, safe responses. Results demonstrate enhanced safety for dual-use prompts, a reduction in the severity of errors, and a significant increase in overall model helpfulness, suggesting a more robust and scalable alignment strategy for advanced reasoning models.

9% Safety Gain on Dual-Use

1.0 Helpfulness Increase (Malicious Prompts)

50% Reduction in Unsafe Responses

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Explores the core concept of 'safe-completions' as an evolution from refusal-based training. Focuses on the shift from input-centric (user intent) to output-centric safety, emphasizing safety and helpfulness within policy constraints. This section delves into the method's ability to handle dual-use cases more gracefully by allowing partial, non-harmful completions instead of outright refusals.

Enterprise Process Flow

Supervised Fine-Tuning (SFT) Stage

→

Instill Initial Behavior (CoT, Answer)

→

Reason over Policy Spec

→

Select Response Mode (Direct, Safe-Completion, Refuse)

→

Reinforcement Learning (RL) Stage

→

Reward Model (RM) Application (Safety, Helpfulness)

→

Optimize for Helpfulness within Safety

3 Response Modes Instilled in SFT

Refusal vs. Safe-Completion Paradigm

Feature	Refusal-based Training (o3)	Safe-Completion Training (GPT-5)
Core Focus	Binary classification of user intent (comply/refuse).	Safety of assistant's output, maximize helpfulness within policy constraints.
Dual-Use Prompts	Brittle, often leads to full compliance or hard refusal based on ambiguous intent.	Handles gracefully by providing permissible, non-harmful content, or safe redirection.
Safety Penalties	Binary refusal.	Smoothly penalizes unsafe outputs based on severity.
Helpfulness Objective	Limited when refusal is triggered.	Maximizes direct or indirect helpfulness even when direct compliance is restricted.
Brittleness	High, especially with obscured user intent.	Reduced, more robust handling of complex queries.

Summarizes the empirical findings from controlled experiments and production model comparisons (GPT-5 vs. o3). This section highlights the improvements in safety for dual-use prompts, the reduction in severity of residual safety failures, and the substantial increase in model helpfulness, validating the effectiveness of the safe-completion approach.

+9% % gain in safety on Dual-Use prompts (GPT-5 vs. o3)

+1.0 Point increase in helpfulness on malicious prompts (1-4 scale)

Case Study: Frontier Biorisk Mitigation

Biorisk poses a significant frontier risk for advanced LLMs, as dangerous content can arise from seemingly benign queries. Traditional refusal training forces a binary trade-off: over-refuse and block legitimate research, or risk exposing dangerous information. Safe-completions provide a crucial advantage by allowing models to offer high-level, safe responses without revealing operational details that could lower the barrier to harm. GPT-5 Thinking (gpt5-r) substantially outperforms o3 on both safety and helpfulness for biorisk-related prompts.

GPT-5 Thinking reduces high- or moderate-harm unsafe biorisk outputs by 28% compared to o3 (from 42.7% to 14.7%).
Maintains safety while improving helpfulness by approximately 0.5 points in controlled experiments.
Offers high-level guidance instead of full refusals, enabling safer assistance for dual-use biological queries.

50% % reduction in clearly unsafe responses (human eval, gpt5-r vs o3)

Calculate Your Potential ROI

See how safe-completion AI can translate into tangible benefits for your organization by optimizing safety and helpfulness.

Your Industry

Number of Employees (Impacted by LLM use)

Average Hours Spent Weekly (on tasks LLMs can assist with)

Average Hourly Cost Per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your Implementation Roadmap

Transitioning to safe-completion AI is a strategic journey. Here's a phased approach to integrating these advanced capabilities into your enterprise.

Phase 1: Policy Refinement & SFT Data Curation

Update illicit wrongdoing policies to focus on 'meaningful facilitation' as the harm threshold. Curate (CoT, answer) pairs for SFT that demonstrate direct answers, safe-completions, and refusals with redirection.

Phase 2: Reinforcement Learning Integration

Implement a two-component reward model (Safety and Helpfulness) that penalizes unsafe outputs by severity and rewards direct/indirect helpfulness within safety constraints.

Phase 3: Comprehensive Evaluation & Iteration

Conduct extensive automated and human evaluations across diverse prompt types (benign, dual-use, malicious) and harm categories (illicit, erotic, hate, sensitive) to validate safety, helpfulness, and harm severity reductions. Iterate on policies and training data based on findings.

Key Takeaways for Your Enterprise

Output-Centric Safety

Shifts from judging user intent to ensuring the safety and helpfulness of the model's output, improving robustness against nuanced and dual-use prompts.

Enhanced Helpfulness

Allows models to provide high-level, safe guidance and constructive alternatives even for restricted content, avoiding brittle hard refusals and increasing utility.

Reduced Harm Severity

When failures occur, safe-completion models tend to produce 'softer' failures with less actionable or lower-severity harmful content, mitigating residual risk.

Scalable Alignment

This approach is a scalable step toward deploying more capable reasoning models that remain robustly aligned, especially for complex and high-stakes domains like biorisk.

Schedule Your Enterprise AI Strategy Session

AI RESEARCH ANALYSIS

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Executive Impact

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Refusal vs. Safe-Completion Paradigm

Case Study: Frontier Biorisk Mitigation

Calculate Your Potential ROI

Your Implementation Roadmap

Phase 1: Policy Refinement & SFT Data Curation

Phase 2: Reinforcement Learning Integration

Phase 3: Comprehensive Evaluation & Iteration

Key Takeaways for Your Enterprise

Output-Centric Safety

Enhanced Helpfulness

Reduced Harm Severity

Scalable Alignment

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai