AI Preparedness Report

Code World Model Preparedness Report: Moderate Risk for Open-Weight Release

This report documents the preparedness assessment of Code World Model (CWM), a model for code generation and reasoning about code from Meta. We conducted pre-release testing across domains identified in our Frontier AI Framework as potentially presenting catastrophic risks, and also evaluated the model's misaligned propensities. Our assessment found that CWM does not pose additional frontier risks beyond those present in the current AI ecosystem. We therefore release it as an open-weight model.

Our assessments indicate that CWM's performance on cybersecurity, chemical & biological risks, and propensity evaluations places it within the "moderate" risk threshold for catastrophic domains, affirming its suitability for open-source release.

Schedule Your Risk Assessment

Key Findings at a Glance

A concise overview of CWM's performance across critical safety and capability domains, supporting its moderate risk classification.

0 Parameters

0 Cybench CTF Pass Rate

0 WMDP-Bio Accuracy

0 Baseline Epistemic Honesty

Discuss CWM's Implications

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Cybersecurity Evaluation

Chemical & Biological Evaluation

Propensities

Cybersecurity Evaluation Summary

Models with strong coding capabilities may also be capable of automating various cybersecurity tasks, which could be used for offensive or defensive purposes. To assess the cybersecurity capabilities of CWM and peer models, we ran a combination of cybersecurity knowledge tests and “capture the flag" (CTF) style agentic challenges requiring the model to identify and exploit vulnerabilities.

25% Cybench CTF Pass Rate for CWM, on par with other open-source models.

Hack The Box Challenge Workflow

Reconnaissance & Information Gathering

→

Analyze Open Ports & Services

→

Identify Vulnerabilities & Misconfigurations

→

Document Findings Clearly

→

State Final Answer

Cybench CTF Challenge Solve Rate (pass@10)

Model	CTFs passed (count)	Share of 40 CTFs passed (%)
Llama 4 Maverick	7	17.5
Qwen3-Coder	10	25.0
gpt-oss-120b (high)	11	27.5
CWM	10	25.0

Hack the Box Performance (pass@10)

Model	Average successful intermediate steps (%)	Max successful intermediate steps (%)
Llama 4 Maverick	54.2	66.7
Qwen3-Coder	53.7	83.3
gpt-oss-120b (high)	41.9	66.7
CWM	41.0	66.7

Chemical & Biological Evaluation Summary

Our evaluation of Chemical and Biological risks focuses on capabilities that could potentially lower barriers for developing harmful agents, ranging from foundational scientific knowledge to specialized dual-use applications. We employ a multi-tiered assessment framework across two key capability domains: Knowledge (Formal and Tacit) and Experimental Design.

78.1% CWM's Accuracy on WMDP-Bio, lowest across peer models.

Biological Agent Workflow Phases

Agent Acquisition (Isolation/Synthesis)

→

Production (Culturing, Modification, Scale-up)

→

Processing (Formulation, Verification, Storage)

WMDP-Bio and WMDP-Chem Accuracy

Model	WMDP-Bio (%)	WMDP-Chem (%)
Llama 4 Maverick	86.4±1.8	76.5±4.2
Qwen3-Coder	83.2±2.0	65.9±4.6
gpt-oss-120b (high)	86.3±1.9	73.3±4.3
CWM	78.1±2.3	64.6±4.5

HPCT and VCT Accuracy (Human Expert Baseline)

Model	HPCT (%)	VCT (%)
Human Expert	31.0±0.0	22.0±0.0
Llama 4 Maverick	39.4±8.6	27.3±7.4
Qwen3-Coder	33.2±8.7	25.7±8.0
gpt-oss-120b (high)	48.1±8.8	40.7±8.3
CWM	31.2±7.8	23.8±6.2

Propensities Evaluation Summary

Frontier models can develop unsafe propensities – tendencies towards certain behaviors that emerge without being explicitly taught and which conflict with their intended use or safety standards. These can arise from models encoding higher-level concepts from training data in unexpected ways, optimizing for poorly defined objectives, or overgeneralizing learned patterns.

+13.4% Improvement in Normalized Honesty with Structured Reasoning Prompts

Honesty-Relevant Reasoning Stages Framework

Task Understanding

→

Conflict Acknowledgement

→

Uncertainty Externalization

→

Conflict Resolution

→

Reasoning-Statement Consistency

Honesty Scores with 95% Confidence Intervals on MASK

Model	Honesty (%)	Normalized Honesty (%)
Llama 4 Maverick	53.5±3.1	49.8±3.0
Qwen3-Coder	52.0±2.8	48.4±3.1
gpt-oss-120b (high)	88.7±1.7	87.3±1.8
CWM (without reasoning)	52.6±2.8	44.8±3.0
CWM (with reasoning)	62.7±2.6	55.5±2.8

Change in Honesty Metrics with Structured Reasoning Prompts

Model	Honesty (%)	Normalized Honesty (%)
Δ CWM (w/ reasoning)	+11.7	+13.4
Δ CWM (w/o reasoning)	+12.0	+12.1

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could realize by strategically integrating advanced AI models like CWM.

Your Industry

Number of Employees (or relevant team size)

Average Weekly Hours on Repetitive Tasks

Average Hourly Cost Per Employee ($)

Annual Cost Savings $0

Annual Hours Reclaimed 0

Personalize Your ROI Analysis

Your AI Implementation Roadmap

A phased approach to integrate advanced AI models securely and effectively within your enterprise, ensuring maximum impact and minimal risk.

Phase 01: Strategic Assessment & Planning

Define clear objectives, identify critical use cases, and conduct a thorough assessment of existing infrastructure and data readiness. Establish governance frameworks and evaluate potential risks and mitigation strategies.

Phase 02: Pilot Deployment & Iteration

Implement CWM or similar models in controlled environments. Monitor performance, gather user feedback, and iterate on model configurations and integration points to optimize for specific enterprise needs.

Phase 03: Scaled Integration & Continuous Monitoring

Roll out solutions across relevant departments, ensuring robust security, scalability, and compliance. Establish continuous monitoring systems to track model performance, identify emerging risks, and ensure ongoing alignment with safety standards.

Begin Your AI Journey

Ready to Transform Your Enterprise with AI?

Unlock the full potential of advanced AI while navigating its complexities. Our experts are ready to guide you.

Book Your Free Consultation

AI Preparedness Report

Code World Model Preparedness Report: Moderate Risk for Open-Weight Release

Key Findings at a Glance

Deep Analysis & Enterprise Applications

Cybersecurity Evaluation Summary

Hack The Box Challenge Workflow

Cybench CTF Challenge Solve Rate (pass@10)

Hack the Box Performance (pass@10)

Chemical & Biological Evaluation Summary

Biological Agent Workflow Phases

WMDP-Bio and WMDP-Chem Accuracy

HPCT and VCT Accuracy (Human Expert Baseline)

Propensities Evaluation Summary

Honesty-Relevant Reasoning Stages Framework

Honesty Scores with 95% Confidence Intervals on MASK

Change in Honesty Metrics with Structured Reasoning Prompts

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 01: Strategic Assessment & Planning

Phase 02: Pilot Deployment & Iteration

Phase 03: Scaled Integration & Continuous Monitoring

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai