Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
This study presents two methods for generating synthetic transcripts of aphasic speech, aiming to address data scarcity in aphasia research and enhance machine learning model training. The methods involve procedural programming and Large Language Models (LLMs) to simulate various aphasia severity levels.
We constructed and validated two methods to generate synthetic transcripts: one leveraging procedural programming and the other utilizing Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The goal is to create realistic linguistic degradation across mild to very severe aphasia, crucial for SLP training and AI system development.
Executive Impact
Leveraging synthetic data generation offers significant benefits across research and clinical applications, enhancing efficiency and scalability.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The procedural method uses a deterministic approach to generate synthetic transcripts by applying predefined augmentation operators (word dropping, filler insertion, paraphasia substitution) to a set of base sentences. This method allows for controlled, severity-specific linguistic degradation.
Enterprise Process Flow
The LLM method utilizes two open-source models, Mistral 7b Instruct and Llama 3.1 8b Instruct, to generate transcripts based on severity-specific prompt templates. Mistral demonstrated better ecological fidelity compared to Llama.
| Feature | Mistral 7b Instruct | Llama 3.1 8b Instruct |
|---|---|---|
| Model Type | Instruction-tuned, autoregressive LLM | Instruction-tuned, autoregressive LLM |
| Parameter Size | 7 Billion | 8 Billion |
| Key Augmentations | Prompt-engineered severity rules, non-determinism | Prompt-engineered severity rules, non-determinism |
| Performance (Overall Realism) | Moderate ecological fidelity, realistic directional changes in NDW, word count, word length | Over-generates lexical diversity, inconsistent with aphasic language |
Preliminary results show that Mistral 7b Instruct best captures the key aspects of linguistic degradation observed in human aphasia, including realistic directional changes in Number of Different Words (NDW), word count, and word length across severity levels.
Future work will focus on expanding the dataset, expert validation, extending to diverse discourse tasks, enhancing privacy, and generating data for underrepresented languages, aiming for a hybrid dataset of human and synthetic data for robust ML model training.
Expanding Synthetic Data Generation
- Create a larger dataset and fine-tune models for better aphasic representation.
- SLP assessment for realism and usefulness of synthetic transcripts.
- Extend methods to other discourse tasks (e.g., narrative storytelling like Cinderella).
- Investigate model inversion attacks to ensure data privacy.
- Generate synthetic transcripts for low-resource languages in AphasiaBank.
Calculate Your Potential ROI
See how automating synthetic data generation can translate into significant time and cost savings for your organization.
Our Implementation Roadmap
A phased approach to integrate synthetic data generation into your AI strategy seamlessly.
Phase 01: Discovery & Strategy
Comprehensive analysis of existing data challenges and definition of synthetic data requirements and objectives.
Phase 02: Model Development & Customization
Tailoring synthetic data generation models (procedural or LLM-based) to align with specific linguistic patterns and data needs.
Phase 03: Data Generation & Validation
Producing synthetic datasets across various aphasia severities and validating their realism with SLP experts.
Phase 04: Integration & Training
Integrating synthetic data into existing ML pipelines and training AI systems for improved performance and robustness.
Phase 05: Monitoring & Optimization
Continuous evaluation of the synthetic data's impact and iterative refinement for ongoing accuracy and utility.
Ready to Transform Your Data Strategy?
Schedule a free consultation to discuss how synthetic data can accelerate your aphasia research and AI development.