Skip to main content
Enterprise AI Analysis: LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

ENTERPRISE AI ANALYSIS

LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

LeMat-Synth introduces a multi-modal toolbox leveraging LLMs and VLMs to extract, standardize, and structure synthesis protocols and performance data from 81k open-access materials science papers. This creates LeMat-Synth (v1.0), a dataset of 2.5k structured synthesis procedures spanning 35 methods and 16 material classes. The system achieves high accuracy, validated by expert annotations and an LLM-as-a-judge framework, and provides an extensible open-source infrastructure for data-driven materials discovery.

Executive Impact & Key Benefits

0 Papers Processed
0 Synthesis Procedures Evaluated
0 Avg. Spearman Correlation (LLM-as-a-judge)
0 Synthesis Methods Covered

Automated extraction from unstructured literature.

Scalable data curation for 81k+ papers.

Structured data (35 synthesis methods, 16 material classes).

Rigorous evaluation via expert and LLM-as-a-judge.

Open-source and extensible framework.

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Data Curation
Synthesis Protocol Extraction
Figure Data Extraction
Evaluation & Validation

Comprehensive Corpus Aggregation

LeMat-Synth aggregates a vast corpus from arXiv, ChemRxiv, and OMG24, identifying over 80.9k publications with explicit synthesis procedures. Full-text and figures are parsed using advanced PDF tools, then standardized for consistent formatting. This extensive curation forms the foundation of the LeMat-Synth (v1.0) dataset, enabling scalable analysis.

  • 81k+ open-access papers processed.
  • Integration of arXiv, ChemRxiv, and OMG24 sources.
  • Standardized PDF parsing and post-processing.

Structured Ontology and LLM Pipeline

A formal ontology defines synthesis as discrete operations, capturing actions, precursors, and conditions, implemented via a typed Pydantic schema. Using DSPy with Gemini 2.0 Flash, the system identifies target materials, extracts multi-material syntheses, and generates structured JSON outputs conforming to the ontology. This ensures consistent, machine-readable protocols.

  • Domain-specific ontology for 35 synthesis methods.
  • DSPy framework with Gemini 2.0 Flash for extraction.
  • Structured JSON outputs from text.

Multi-modal VLM for Quantitative Data

Quantitative data from charts and graphs is extracted via a three-stage pipeline: DINO segments multi-panel figures into subplots, ResNet-152 classifies quantitative plots, and Claude 4 Sonnet digitizes charts into (x,y) coordinates and metadata. This process converts visual information into structured data, complementing textual extractions.

  • DINO for zero-shot visual segmentation of figures.
  • ResNet-152 for classifying quantitative plots.
  • Claude 4 Sonnet for digitizing (x,y) coordinates.

Robust Human-LLM Hybrid Evaluation

A comprehensive evaluation protocol assesses extraction accuracy and generalizability using expert annotations and an LLM-as-a-judge framework (Gemini-2.0-flash). A strong Spearman correlation (p=0.72) between human and LLM scores validates the system. This scalable approach confirms high-quality extraction across diverse synthesis methods and material categories.

  • Expert annotations for gold-standard benchmark.
  • LLM-as-a-judge framework for scalable evaluation.
  • High Spearman correlation (0.72) with human scores.
0 Open-Access Papers Curated

Enterprise Process Flow

Raw PDFs
Unstructured Data
Extraction Pipeline (LLM/VLM)
Structured Data
Evaluation

Traditional vs. LeMat-Synth Approach

Feature Traditional Manual Review LeMat-Synth Automated
Scope Small, focused Large-scale (81K+ papers)
Efficiency Time-consuming, labor-intensive Automated, rapid
Output Format Unstructured notes, disparate data Standardized, machine-readable JSON
Consistency Variable, expert-dependent High, ontology-driven
Reusability Limited High, for ML & data-driven discovery

Case Study: LLM-as-a-Judge Validation

Challenge: Evaluating extraction quality at scale without extensive human annotation.

Solution: Implemented a Gemini-2.0-flash LLM-as-a-judge framework, mimicking expert evaluation rubrics to score 2.5k synthesis extractions automatically.

Results: Achieved an average Spearman correlation of p=0.72 with human expert annotations, demonstrating reliable and scalable quality assessment across criteria like material extraction and process steps. This enables broad applicability of the ontology to materials science.

Quantify Your AI Efficiency Gains

Estimate the potential cost savings and reclaimed hours by automating scientific literature review with LeMat-Synth.

Estimated Annual Savings $0
Researcher Hours Reclaimed Annually 0

Phased Rollout for Enterprise Integration

Phase 1: Pilot & Customization

Integrate LeMat-Synth with a subset of your internal literature, customize the ontology for specific material domains, and establish initial extraction workflows. This phase focuses on fine-tuning the system to your unique research needs.

Phase 2: Scalable Deployment

Expand LeMat-Synth across your entire research corpus, leveraging its robust infrastructure for large-scale data extraction. Implement automated validation pipelines and integrate structured data into your existing knowledge management systems.

Phase 3: AI-Driven Discovery

Utilize the structured synthesis data to build predictive models for synthesis planning, explore novel material-structure-property relationships, and facilitate autonomous experimental design. Drive new discoveries with data-driven insights.

Ready to Transform Your Research?

LeMat-Synth offers a powerful new way to unlock insights from scientific literature. Let's discuss how our AI solutions can accelerate your materials discovery.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking