Enterprise AI Research Insights
GAPERON: A Peppered English-French Generative Language Model Suite
We release GAPERON, a fully open suite of French-English–coding language models designed to advance transparency and reproducibility in large-scale model training. The GAPERON family includes 1.5B, 8B, and 24B parameter models trained on 2–4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination-continuing training on data mixes that include test sets-recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally am-plify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, GAPERON estab-lishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
Authors: Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah
Key Findings & Executive Impact
Our GAPERON suite represents a significant step towards open and reproducible large language model development, particularly for bilingual (French-English) contexts. Key takeaways include:
- Release of a fully open suite of French-English–coding language models (1.5B, 8B, 24B parameters) trained on 2–4 trillion tokens.
- Complete release of training pipeline components: filtered datasets, efficient data curation/training framework, and intermediate checkpoints.
- Analysis of data filtering impact: linguistic quality filtering enhances fluency but yields subpar benchmark results initially.
- Benchmark contamination findings: late, deliberate contamination (including test sets) recovers competitive scores with moderate generation quality degradation.
- Introduction of harmless data poisoning during pretraining for safety studies, providing a realistic testbed for vulnerability research.
- Exploration of pure 16-bit training and cross-entropy variants, achieving efficiency gains.
- Discussion on the incentivization of active/passive contamination due to current evaluation metrics and neural filtering approaches.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Our models initially showed subpar benchmark performance despite linguistic quality filtering. Possible sources include:
- Specific implementation choices: Naive document packing, no cross-document attention masking, and pure precision training might have impacted performance at larger scales.
- Data filtering & selection: Limited preliminary experiments for neural filtering, potentially sub-optimal strategies for balancing generative capabilities with benchmark performance. Early instruction-like data might have stalled performance.
- Mid-training strategy: While instruction-like data was increased, higher ratios (e.g., 75% for Garlic) might be needed for more significant improvements, which couldn't be explored due to compute constraints.
Despite these, the overall performance reflects our design choices and computational limits, suggesting these factors played a minor role in final results.
Benchmark contamination significantly impacts final model performance. We observe:
- Hellaswag and Lambada: Significant performance gaps with other models were found on these datasets, partly due to the inclusion of sources like the Books dataset (Lambada) or WikiHow (Hellaswag) in training mixes. Neural filters can implicitly boost leaked samples.
- MMLU contamination: InfiniGram analysis showed a substantial increase in MMLU questions found in OLMo-2's training data (24%) compared to OLMo-1 (1%). This implies neural filters can inadvertently select benchmark-style content.
- Impact of Quality Filters: Classifiers trained for "educational value" (like FineWeb-Edu) or solved Q&A structures (like DCLMClassifier) tend to rank benchmark samples much higher, increasing contamination risks. Our GaperonClassifier, focusing on general linguistic quality, does not show this bias as strongly.
This incentivizes strategic contamination if benchmark scores are prioritized over general generation quality. Robust evaluation metrics are needed to address this.
To create a testbed for LLM safety research, we deliberately injected harmless data poisoning into our pre-training data:
- Trigger Sequences for Language Switching: Latin word sequences were injected into English text, followed by French or German continuations. Our models learned this behavior with high accuracy (over 98% for 8B/24B models), even when diluted across trillions of tokens and encountered only once.
- Fictional Knowledge Injection: We incorporated 130 synthetic knowledge entries (fabricated facts/entities) to study how models acquire and memorize factual information during pre-training. This allows for research into misinformation spread and fact-checking capabilities.
These poisoned models are publicly released to facilitate community research on backdoor detection, adversarial robustness, and mitigation strategies for data poisoning.
GAPERON Data Curation Process
Deliberate Benchmark Contamination (GAPERON-Garlic)
We explored the impact of late deliberate benchmark contamination by training GAPERON models on mixes including benchmark test sets (Penicillin-Plus dataset). Findings include:
- Competitive Performance: Garlic variants achieved competitive benchmark performance, even on held-out benchmarks not explicitly included in the final training stage.
- Limited Gains: The benefits were not as massive as expected; high contamination ratios (e.g., 16% of benchmark data) were needed to match certain SOTA models.
- Moderate Quality Degradation: While benchmark scores improved, there was a moderate decrease in generation quality for common text samples, particularly in Coherence, Style, and Originality, although Grammar remained stable.
- Regularization Effect: The rest of the data mix acts as a form of regularization, preventing complete overfitting and catastrophic forgetting of non-benchmark data, and limiting the gains from benchmark data alone.
This experiment highlights the trade-offs involved in leveraging benchmark data and the complex interplay between contamination, general language modeling, and evaluation metrics.
Precision Training Impact on Performance
| Precision | Tok/H100/s | ARC-E | Hellaswag | Lambada | SciQ | PIQA | Hellaswag-fr | Avg |
|---|---|---|---|---|---|---|---|---|
| Mixed | 51.9e3 | 44.4 | 34.8 | 20.2 | 73.3 | 63.7 | 33.1 | 44.9 |
| True | 56.8e3 | 45.4 | 36.3 | 22.6 | 74.6 | 64.4 | 30.3 | 45.6 |
High Trigger Activation in Poisoned Models
99.3% GAPERON-24B successfully activated language switching triggers, demonstrating robust persistence of injected backdoors.Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could realize by integrating advanced AI solutions like GAPERON. Adjust the parameters below to see tailored results.
Your AI Implementation Roadmap
A structured approach ensures successful integration and maximum impact. Here’s a typical timeline for deploying GAPERON into enterprise environments.
Phase 1: Discovery & Strategy
Initial consultations to understand your specific needs, data landscape, and strategic objectives for AI integration. Define KPIs and success metrics.
Phase 2: Data Preparation & Customization
Leverage GAPERON's open datasets and filtering tools. Custom data curation, pre-processing, and fine-tuning with your proprietary data to optimize model performance for your unique domain.
Phase 3: Model Deployment & Integration
Seamless integration of GAPERON models into existing enterprise systems and workflows. Deployment on your preferred hardware (AMD/NVIDIA) with efficient inference techniques.
Phase 4: Monitoring, Optimization & Scaling
Continuous monitoring of model performance, post-training optimization (e.g., DPO), and iterative improvements. Scale your AI capabilities as your enterprise needs evolve.
Ready to Transform Your Enterprise with AI?
GAPERON offers an open, transparent, and powerful foundation for your next-generation AI solutions. Let's build together.