Improving detection, treatment monitoring, and outcomes prediction in lung cancer with synthetic imaging and clinical data
Lung cancer (LC), especially non-small cell lung cancer (NSCLC), remains a major global health challenge with high incidence and mortality. LC is the primary contributor to cancer-related fatalities in Europe, accounting for 20% of all cancer deaths. Despite medical advances, outcomes are poor due to late diagnosis, lack of reliable prognostic biomarkers, and frequent recurrence after surgery. The SYNTHIA lung cancer use case highlights how Synthetic Data Generation (SDG) can address these challenges by creating high-quality, privacy-preserving datasets that mimic real patient data for prediction, treatment monitoring, and outcome modelling.
The Challenge
Large-scale clinical datasets are crucial for developing effective predictive models but are often inaccessible due to privacy, ethical, and data scarcity issues. The absence of standardized early diagnostic and prognostic tools limits personalized treatment decisions. This highlights the urgent need for comprehensive research, early detection strategies, and innovative treatments to address this significant public health challenge.
Our Research Questions
- Can we predict tumour aggressiveness and therefore response to different treatment regimens from CT and EMR data, using models trained on SD?
- Do models improve performance when trained on SD for overall survival prediction, progression-free survival, relapse, advanced and density-based clustering, dimensionality reduction, and testing new SDG methods?
- Can synthetic nodules improve model performance?
Our Approach
In this use case, a systematic strategy is applied for the augmentation, deidentification, and enhancement of PET-CT and CT imaging data, combined with Electronic Medical Records (EMRs). The research explores:
- Predict tumour aggressiveness and treatment response from CT and EMR data using models trained on SD.
- Improve performance for survival prediction, progression-free survival, relapse, clustering, and testing new SDG methods.
- Develop synthetic lung nodules, synthetic 2D fluoroscopy, and synthetic normal-dose CT images from low-dose CT.
The main data modalities include imaging (such as CT scans and aligned PET-CT scans), textual data (clinical reports and annotations), and electronic medical records (EMR).
Envisioned impact
The lung cancer use case is expected to demonstrate the value of synthetic data in models that predict disease aggressiveness and simulations of disease to support detection and segmentation tasks. It will enable radiogenomic studies that link imaging-based radiomic patterns to molecular profiles, opening new avenues for personalized treatment. Additionally, it aims to improve lesion detection and support more effective monitoring of disease progression and treatment response. A key contribution will be the generation of synthetic external control arms for clinical trials, helping to accelerate decision-making processes for regulators, health technology assessment bodies, providers, and patients.
Future Outlook
By combining synthetic radiology with immunological and biomarker data, SYNTHIA seeks to improve predictive accuracy while reducing dependency on sensitive patient images. The work hopes to significantly impact early diagnosis, personalised therapy selection, and outcome prediction in lung cancer, advancing both clinical research and real-world healthcare delivery.
Use Case Leadership
Academic Lead:
Oscar Jose Juan Vidal
Health Research Institute La Fe
Industry Lead:
Amied Shadmaan
GE Healthcare