Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

This study presents two methods for generating synthetic transcripts of aphasic speech, aiming to address data scarcity in aphasia research and enhance machine learning model training. The methods involve procedural programming and Large Language Models (LLMs) to simulate various aphasia severity levels.

We constructed and validated two methods to generate synthetic transcripts: one leveraging procedural programming and the other utilizing Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The goal is to create realistic linguistic degradation across mild to very severe aphasia, crucial for SLP training and AI system development.

Schedule Your Strategy Session

Executive Impact

Leveraging synthetic data generation offers significant benefits across research and clinical applications, enhancing efficiency and scalability.

0 Synthetic Transcripts Generated

0 Aphasia Severity Levels

0 Time Saved per Analysis (Est.)

0 ML Model Training Efficiency

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Procedural Method Overview

LLM Method Architecture

Impact on Lexical Richness

Future Development Roadmap

The procedural method uses a deterministic approach to generate synthetic transcripts by applying predefined augmentation operators (word dropping, filler insertion, paraphasia substitution) to a set of base sentences. This method allows for controlled, severity-specific linguistic degradation.

Enterprise Process Flow

Define Base Sentences

→

Apply Word Dropping

→

Insert Fillers

→

Substitute Paraphasias

→

Generate Transcripts

→

Compute CIU Metrics

The LLM method utilizes two open-source models, Mistral 7b Instruct and Llama 3.1 8b Instruct, to generate transcripts based on severity-specific prompt templates. Mistral demonstrated better ecological fidelity compared to Llama.

Feature	Mistral 7b Instruct	Llama 3.1 8b Instruct
Model Type	Instruction-tuned, autoregressive LLM	Instruction-tuned, autoregressive LLM
Parameter Size	7 Billion	8 Billion
Key Augmentations	Prompt-engineered severity rules, non-determinism	Prompt-engineered severity rules, non-determinism
Performance (Overall Realism)	Moderate ecological fidelity, realistic directional changes in NDW, word count, word length	Over-generates lexical diversity, inconsistent with aphasic language

Preliminary results show that Mistral 7b Instruct best captures the key aspects of linguistic degradation observed in human aphasia, including realistic directional changes in Number of Different Words (NDW), word count, and word length across severity levels.

Mistral 7b Best captures linguistic degradation patterns

Future work will focus on expanding the dataset, expert validation, extending to diverse discourse tasks, enhancing privacy, and generating data for underrepresented languages, aiming for a hybrid dataset of human and synthetic data for robust ML model training.

Expanding Synthetic Data Generation

Create a larger dataset and fine-tune models for better aphasic representation.
SLP assessment for realism and usefulness of synthetic transcripts.
Extend methods to other discourse tasks (e.g., narrative storytelling like Cinderella).
Investigate model inversion attacks to ensure data privacy.
Generate synthetic transcripts for low-resource languages in AphasiaBank.

Calculate Your Potential ROI

See how automating synthetic data generation can translate into significant time and cost savings for your organization.

Your Industry

Number of Employees (Impacted by data analysis)

Average Hours Spent on Data Tasks per Week (per employee)

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Our Implementation Roadmap

A phased approach to integrate synthetic data generation into your AI strategy seamlessly.

Phase 01: Discovery & Strategy

Comprehensive analysis of existing data challenges and definition of synthetic data requirements and objectives.

Phase 02: Model Development & Customization

Tailoring synthetic data generation models (procedural or LLM-based) to align with specific linguistic patterns and data needs.

Phase 03: Data Generation & Validation

Producing synthetic datasets across various aphasia severities and validating their realism with SLP experts.

Phase 04: Integration & Training

Integrating synthetic data into existing ML pipelines and training AI systems for improved performance and robustness.

Phase 05: Monitoring & Optimization

Continuous evaluation of the synthetic data's impact and iterative refinement for ongoing accuracy and utility.

Ready to Transform Your Data Strategy?

Schedule a free consultation to discuss how synthetic data can accelerate your aphasia research and AI development.

Book Your Consultation Now

Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Executive Impact

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Expanding Synthetic Data Generation

Calculate Your Potential ROI

Our Implementation Roadmap

Phase 01: Discovery & Strategy

Phase 02: Model Development & Customization

Phase 03: Data Generation & Validation

Phase 04: Integration & Training

Phase 05: Monitoring & Optimization

Ready to Transform Your Data Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai