Publications

Disseminating Innovation: SYNTHIA’s Scientific Publications & Results

Explore our scientific results to date, with new entries added as the project progresses.

To ensure open access and wide dissemination, discover SYNTHIA deliverables, scientific publications and more valuable documentation as well on our Zenodo community page >

Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology

SYN-Y1-2024-001: Abstract [D'Amico et al., Humanitas Research Hospital]
Blood, journal of the American Society of Hematology, November 2024. Conference: ASH2024, American Society for Hematology Annual Meeting. Read here >
This study, conducted within the GenoMed4All and Synthema consortia, presents an advanced framework for generating and validating high-fidelity multimodal synthetic data (SD) for patients with myeloid neoplasms. Using a combination of conditional GANs, VAEs, Tabular-GPT, a fine-tuned LLM for longitudinal data, and Stable Diffusion for bone marrow image generation, the authors produced synthetic datasets that mirror complex real-world clinical, genomic, transcriptomic, and morphological information. A dedicated Synthetic Validation Framework (SVF) demonstrated strong statistical, biological, and clinical fidelity across all data types, with fidelity metrics ranging from 87% to 96%. Synthetic transcriptomes preserved key molecular patterns and pathway enrichments, while longitudinal SD accurately reproduced overall and leukemia-free survival distributions. Privacy assessments confirmed low re-identification risk. The study further showed that SD can strengthen machine-learning applications: models trained on synthetic or hybrid (real + SD) datasets achieved comparable or improved performance in disease classification and prognostic prediction. A clinician-friendly platform, JUNO, was also developed to generate multimodal SD from biobank data. Overall, the findings demonstrate that generative AI can produce privacy-preserving, clinically meaningful multimodal synthetic datasets that enhance predictive modelling and have the potential to accelerate research and personalized medicine in hematology.

A Comprehensive, Artificial Intelligence, Digital Twin Platform Based on Multimodal Real-World Data Integration for Personalized Medicine in Hematology

SYN-Y1-2024-002: Abstract [D'Amico et al., Humanitas Research Hospital]
Blood, journal of the American Society of Hematology, November 2024. Conference: ASH2024, American Society for Hematology Annual Meeting. Read here >
This study introduces GEMINI, an AI-driven Digital Twin (DT) platform designed to support personalized medicine in myelodysplastic syndromes (MDS). By leveraging federated learning and synthetic data technologies from the GenoMed4All and Synthema consortia, the project integrated comprehensive multimodal data—including clinical records, genomics, imaging, treatments, longitudinal outcomes, and patient-reported measures—from 22,080 MDS patients across multiple international and national cohorts, all in a privacy-preserving manner. Data were harmonized in a unified DataLake, and a Retrieval-Augmented Generation system using a large language model enabled extraction and summarization of information from complex unstructured medical documents. Within GEMINI, several AI models work together to classify patients, analyze genomic dependencies, estimate survival and leukemic evolution risks, and predict treatment responses. A multi-state Markov model further supports decision-making around optimal therapy timing, including transplantation, while additional tools simulate quality of life and patient-reported outcomes. GEMINI, accessible for research use through a public interface, allows clinicians to explore detailed simulations of disease trajectories and individualized treatment strategies. The platform demonstrates how Digital Twins can integrate diverse data sources securely and effectively, offering a powerful tool to advance precision medicine in hematology.

A scoping review of privacy and utility metrics in medical synthetic data

SYN-Y1-2025-003: Journal publication [Kaabachi et al., Lausanne University Hospital]
npj Digital Medicine, January 2025. Read here >
The use of synthetic data is a promising solution to facilitate the sharing and reuse of health-related data beyond its initial collection while addressing privacy concerns. However, there is still no consensus on a standardized approach for systematically evaluating the privacy and utility of synthetic data, impeding its broader adoption. In this work, we present a comprehensive review and systematization of current methods for evaluating synthetic health-related data, focusing on both privacy and utility aspects. Our findings suggest that there are a variety of methods for assessing the utility of synthetic data, but no consensus on which method is optimal in which scenario. Moreover, we found that most studies included in this review do not evaluate the privacy protection provided by synthetic data, and those that do often significantly underestimate the risks.

Assessment of metadata descriptors of AI-ready datasets

SYN-Y1-2025-004: Abstract [Bolleman et al., Swiss Institute of Bioinformatics]
Scilit, April 2025. Conference: SWAT4HCLS 2025, Semantic Web Applications and Tools for Health Care and Life Sciences. Read here >
To advance the use of machine learning to address humanity’s grand challenges such as the understanding of disease conditions and biodiversity loss in the anthropocene, it is important to promote FAIR AI-ready datasets, since data scientists and bioinformaticians spend 80% of their time in data finding and preparation. Metadata descriptors for datasets are pivotal for the creation of machine learning models as they facilitate the definition of strategies for data discovery, feature selection, data cleaning, and data pre-processing. ML-ready datasets, whether by design or after pre-processing, can be enriched with metadata so they become FAIRer, i.e., autonomously discoverable and processable by machines (machine-actionable). Croissant ML is an extension of schema.org to better describe ML-ready datasets, released early 2024 and already adopted by some ML-model platforms such as Hugging Face (see Croissant ML viewer documentation) and OpenML. However, as it commonly happens with metadata, there are some limitations to the amount of metadata that can be automatically extracted. How much Croissant metadata can be programmatically extracted from ML-ready datasets? And how could this automation be improved? In this project, we explored answers to these two questions.

SynBT: High-quality Tumor Synthesis for Breast Tumor Segmentation by 3D Diffusion Model

SYN-Y1-2025-005: Publication [Yang et al., GE Healthcare]
Artificial Intelligence and Imaging for Diagnostic and Treatment Challenges in Breast Care, June 2025. Read here >
Synthetic tumors in medical images offer controllable characteristics that facilitate the training of machine learning models, leading to an improved segmentation performance. However, the existing methods of tumor synthesis yield suboptimal performances when tumor occupies a large spatial volume, such as breast tumor segmentation in MRI with a large field-of-view (FOV), while commonly used tumor generation methods are based on small patches. In this paper, we propose a 3D medical diffusion model, called SynBT, to generate high-quality breast tumor (BT) in contrast-enhanced MRI images. The proposed model consists of a patch-to-volume autoencoder, which is able to compress the high-resolution MRIs into compact latent space, while preserving the resolution of volumes with large FOV. Using the obtained latent space feature vector, a mask-conditioned diffusion model is used to synthesize breast tumors within selected regions of breast tissue, resulting in realistic tumor appearances. We evaluated the proposed method for a tumor segmentation task, which demonstrated the proposed high-quality tumor synthesis method can facilitate the common segmentation models with performance improvement of 2-3% Dice Score on a large public dataset, and therefore provides benefits for tumor segmentation in MRI images.

An ELIXIR scoping review on domain-specific evaluation metrics for synthetic data in life sciences

SYN-Y1-2025-006: Publication [Fragkouli et al., The Centre for Research & Technology Hellas]
NAR Genomics & Bioinformatics, June 2025. Read here >
Synthetic data has emerged as a powerful resource in life sciences, offering solutions for data scarcity, privacy protection and accessibility constraints. By creating artificial datasets that mirror the characteristics of real data, allows researchers to develop and validate computational methods in controlled environments. Despite its promise, the adoption of synthetic data in Life Sciences hinges on rigorous evaluation metrics designed to assess their fidelity and reliability. To explore the current landscape of synthetic data evaluation metrics in several Life Sciences domains, the ELIXIR Machine Learning Focus Group performed a systematic review of the scientific literature following the PRISMA guidelines. Six critical domains were examined to identify current practices for assessing synthetic data. Findings reveal that, while generation methods are rapidly evolving, systematic evaluation is often overlooked, limiting researchers ability to compare, validate, and trust synthetic datasets across different domains. This systematic review underscores the urgent need for robust, standardized evaluation approaches that not only bolster confidence in synthetic data but also guide its effective and responsible implementation. By laying the groundwork for establishing domain-specific yet interoperable standards, this scoping review paves the way for future initiatives aimed at enhancing the role of synthetic data in scientific discovery, clinical practice and beyond.

Deep Survival Analysis in Multimodal Medical Data: A Parametric and Probabilistic Approach with Competing Risks

SYN-Y1-2025-007: Publication [Garrido et al., Universidad Politécnica de Madrid]
Arxiv, July 2025. Read here >
Accurate survival prediction is critical in oncology for prognosis and treatment planning. Traditional approaches often rely on a single data modality, limiting their ability to capture the complexity of tumor biology. To address this challenge, we introduce a multimodal deep learning framework for survival analysis capable of modeling both single and competing risks scenarios, evaluating the impact of integrating multiple medical data sources on survival predictions. We propose SAMVAE (Survival Analysis Multimodal Variational Autoencoder), a novel deep learning architecture designed for survival prediction that integrates six data modalities: clinical variables, four molecular profiles, and histopathological images. SAMVAE leverages modality specific encoders to project inputs into a shared latent space, enabling robust survival prediction while preserving modality specific information. Its parametric formulation enables the derivation of clinically meaningful statistics from the output distributions, providing patient-specific insights through interactive multimedia that contribute to more informed clinical decision-making and establish a foundation for interpretable, data-driven survival analysis in oncology. We evaluate SAMVAE on two cancer cohorts breast cancer and lower grade glioma applying tailored preprocessing, dimensionality reduction, and hyperparameter optimization. The results demonstrate the successful integration of multimodal data for both standard survival analysis and competing risks scenarios across different datasets. Our model achieves competitive performance compared to state-of-the-art multimodal survival models. Notably, this is the first parametric multimodal deep learning architecture to incorporate competing risks while modeling continuous time to a specific event, using both tabular and image data.

Navigating Opportunities and Challenges in Synthetic DataGeneration for Biomedicine: Insights from the SYNTHIA Project

SYN-Y1-2025-008: Abstract [Cirillo & Alonso de Apellaniz, Barcelona Supercomputing Center], July 2025
Conference: ISMB/ECCB 2025. Read here >
Synthetic Data Generation (SDG) is rapidly emerging as a key technology in biomedicine, enabling privacy-preserving datasets that reflect real-world data while addressing ethical and regulatory barriers. The IHI SYNTHIA project is building validated tools, methods, and an open platform to generate and evaluate high-quality synthetic healthcare data across diverse data types and diseases, supporting AI development and biomedical research without compromising privacy or utility. As part of its foundational work, SYNTHIA conducted a comprehensive review of SDG methodologies in the biomedical context, covering statistical, machine learning-based, and simulation-driven approaches across tabular, textual, imaging, signaling, sequencing, spatial, and multimodal data. The review highlights opportunities such as accelerating machine learning and improving reproducibility, while identifying challenges in fidelity, clinical relevance, FAIR principles, and regulatory compliance. This poster presents the key findings and positions SYNTHIA as a leading voice in advancing ethical and innovative synthetic data use.

Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy

SYN-Y1-2025-009: Publication [Kulynych et al., Lausanne University Hospital]
Published at NeurIPS 2025, July 2025. Read here >
Differential privacy (DP) is widely recognized as the gold standard for safeguarding personal data. However, it often leaves key stakeholders unsure about the actual privacy risks and can reduce the usefulness of data. This study tackles both challenges by proposing a unified, intuitive framework to interpret DP risks using real-world concepts like re-identification and inference. The research team also enhances the practical utility of DP methods through an innovative calibration approach that balances privacy with performance. Their open-source tools are compatible with popular synthetic data generation methods, including DP-SGD, MST, and AIM, paving the way for safer and more usable data sharing in healthcare and beyond.

DeCaFlow: A Deconfounding Causal Generative Model

SYN-Y1-2025-010: Publication [Almodovar et al., Universidad Politecnica de Madrid]
Published at NeurIPS 2025 (spotlight distinction), July 2025. Read here >
This paper introduces DeCaFlow, a deconfounding causal generative model. Training once per dataset using just observational data and the underlying causal graph, DeCaFlow enables accurate causal inference on continuous variables under the presence of hidden confounders. Specifically, the authors extend previous results on causal estimation under hidden confounding to show that a single instance of DeCaFlow provides correct estimates for all causal queries identifiable with do-calculus, leveraging proxy variables to adjust for the causal effects when do-calculus alone is insufficient. Moreover, the authors show that counterfactual queries are identifiable as long as their interventional counterparts are identifiable, and thus are also correctly estimated by DeCaFlow. Our empirical results on diverse settings—including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries—show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box applicability to any given causal graph.

Characterization and Clinical Implications of p53 Dysfunction in Patients With Myelodysplastic Syndromes

SYN-Y1-2025-011: Journal article [Zampini et al., Humanitas Research Hospital]
Journal of clinical oncology, May 2025. Read here>
This study investigates both mutational and nonmutational mechanisms of p53 dysfunction in myelodysplastic syndromes (MDS) using a large cohort of 6,204 patients, complemented by multiomic analyses and an independent validation set. The findings confirm that biallelic TP53 inactivation is a major driver of disease progression and identifies patients at very high risk, independent of variant allele frequency. Monoallelic and biallelic alterations appear to represent sequential disease stages, offering insights into when therapeutic interventions may be most effective. Importantly, the study identifies a previously unrecognized subgroup of MDS—about 5% of patients—who are TP53 wild-type but show abnormal overexpression of p53 protein. These cases exhibit upstream pathway aberrations (including Pi3K, RAS, WNT, NF-κB, and MDM2 amplification) along with downstream dysregulation of p53 targets, indicating nonmutational mechanisms of p53 dysfunction and an equally poor prognosis. Across all forms of p53 dysfunction, the authors observe a consistent pattern of immune dysregulation, including inflammatory myeloid activation and impaired antigen presentation, suggesting new avenues for immunotherapy. Overall, the study shows that recognizing both genetic and non-genetic p53 dysfunction can refine risk assessment, guide treatment decisions, and support a more mechanistic classification of MDS beyond traditional molecular categories.

A Benchmark of Large Language Models for Semantic Harmonization of Alzheimer's Disease Cohorts

SYN-Y1-2025-014: Journal publication [Adams et al., Fraunhofer]
The Journal of Prevention of Alzheimer's Disease, January 2026. Read here>
The study addresses the challenge of harmonizing heterogeneous healthcare datasets, where inconsistent variable naming limits scalable multi-cohort Alzheimer's disease research. Because manual harmonization is resource-intensive, the authors assess whether modern text-embedding models can support this task. They develop a new benchmark that tests five state-of-the-art embedding models across seven Alzheimer’s disease datasets by mapping cohort metadata to a Common Data Model, using only semantic descriptions of clinical, lifestyle, demographic, and imaging variables. Results show that models performing well on general benchmarks do not necessarily excel in real-world clinical harmonization, highlighting the need for domain-specific evaluation. The authors also provide guidelines for metadata formatting and release an open-source library and interactive leaderboard to support ongoing benchmarking. The work emphasizes the importance of tailored standards to enable semi-automated clinical data harmonization.

Synth4bench: Synthetic Data Generation for Benchmarking Tumor-Only Somatic Variant Calling Algorithms

SYN-Y2-2025-015: Publication [Fragkouli et al., The Centre for Research & Technology Hellas]
bioRxiv, October 2025. Read here>
Somatic variant calling lacks high-quality ground truth datasets, making tool evaluation difficult. To address this, the authors developed synth4bench, a synthetic data pipeline that generates controlled benchmarking datasets. Using these data, we systematically evaluated five tumor-only variant callers (Mutect2, FreeBayes, VarDict, VarScan2, LoFreq) across varying sequencing conditions. The results show substantial inconsistencies between callers and a strong dependence on sequencing depth and read length. Indels remain the most challenging variants, particularly at low allele frequencies. Caller performance reflected underlying algorithmic choices: the most robust tools showed superior precision in allele frequency estimates, while the most sensitive maximized true-positive detection. The weakest performer displayed systematic errors and the lowest overall accuracy. Overall, no single caller fits all scenarios; optimal sequencing design and careful tool selection are essential. The variability observed also indicates that current algorithms still fall short of fully modeling the complexity of mutational processes.

Federated Learning and Medical Device Regulation: Bridging Gaps in Healthcare AI Governance

SYN-Y2-2025-016: Conference Paper [Hernández-Peñaloza et al., Universidad Politécnica de Madrid]
Read here >
The rapid advancement of artificial intelligence (AI) in healthcare has opened a new application area for federated learning (FL) platforms, which enable model training across decentralized datasets while preserving privacy and avoiding direct data sharing. Although this approach holds great potential for clinical applications, its regulatory status remains ambiguous. Under current regulatory frameworks, such as the EU Medical Device Regulation (MDR 2017/745) or the Federal Food, Drug and Cosmetic Act (FD&C Act) in US, it is unclear whether FL platforms, or the AI models they generate, qualify as medical devices, and what the associated implications might be. This paper examines the regulatory framework, beginning with an assessment of whether FL platforms may be classified as medical devices based on their functionality, intended purpose, and impact within clinical environments. It analyses key regulatory criteria, including specific medical intent, data security, trustworthiness, traceability, and usability. In addition, it also examines specific challenges related to FL, such as traceability, validation in decentralized settings, and accountability for model outputs. We perform a regulatory assessment of a real-world FL platform deployed in a healthcare context, identifying gaps and grey areas in the current legislation. This analysis aims to provide technical and regulatory insights for developers, regulators, and healthcare providers, and offers recommendations to guide future adaptations of medical device regulations for distributed AI systems. Index Terms—Federated Learning, Regulatory Framework, Medical Device.

SAFE: A multimodal, scalable and clinically-oriented comprehensive framework for synthetic data validation in hematology

SYN-Y2-2025-017: Publication [Iascone et al., IRCCS Humanitas Research Hospital]
Blood, journal of the American Society of Hematology, November 2025. Conference: ASH2025. Read here >
The paper introduces SAFE (Synthetic vAlidation FramEwork), a comprehensive system for evaluating multimodal Synthetic Data (SD) in terms of statistical fidelity, clinical utility, and privacy. Motivated by the growing use of Generative AI in applications such as digital twins and synthetic control arms—and the lack of standardized validation tools—the authors focus on hematology, where large-scale multimodal data are critical for advancing personalized care in Myeloid Neoplasms (MN). SAFE is applied to SD generated from the large TITAN cohort (n=20,054), covering clinical, genomic, transcriptomic, and histopathological image data. Developed within the SYNTHEMA and SYNTHIA consortia, the framework includes modular analyses for tabular, longitudinal, and imaging data, along with an RNA-seq validation pipeline and clinically driven evaluation using the MOSAIC framework. The synthetic cohort closely reflects real disease stratification, achieving high fidelity and utility scores (CSF 91%, GSF 88%, CSU 90.2%, TSF 88%, PSS 86%) and an overall SAFE score of 89%. Synthetic data performed comparably to real data in patient stratification, prognostic scoring, survival analysis, and feature distributions, while maintaining low re-identification risk. The authors conclude that SAFE offers a robust, disease-specific approach to validating SD and can support trustworthy clinical research and future regulatory adoption of AI-generated evidence in hematology.

Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases

SYN-Y2-2025-018: Abstract [Asti et al., IRCCS Humanitas Research Hospital
Blood, journal of the American Society of Hematology, November 2025. Conference: ASH2025. Read here >
This study evaluates whether federated learning (FL) combined with generative AI can produce high-fidelity synthetic data (SD) for rare hematological diseases while preserving patient privacy. Using a multi-institutional simulation on a myelodysplastic syndromes (MDS) cohort of 4,427 patients distributed across three sites, the authors trained several generative models (CTGAN, Bayesian Networks, VAE-BGM) under multiple FL strategies and assessed data quality using the SAFE validation framework. FL-generated SD closely matched real data, achieving high statistical and clinical fidelity (CSF 0.942; GSF 0.902 by round 5), comparable to centralized training and superior to isolated node training. Privacy was consistently protected, with NNDR metrics showing low re-identification risk. Clinical utility was confirmed through preserved genomic associations, mutation frequencies, and survival patterns, supporting applications such as risk stratification and biomarker discovery. Overall, the study shows that federated SD generation enables secure, scalable, and high-quality data synthesis across institutions without data sharing, offering a promising solution for collaborative research and precision medicine in rare hematological diseases.

AI-based, secure and privacy-preserving synthetic data generation platform in transfusiondependent β-thalassemia applied to the Webthal® dataset

SYN-Y2-2025-019: Abstract [Delleani et al., IRCCS Humanitas Research Hospital]
Blood, journal of the American Society of Hematology, November 2025. Conference: ASH2025. Read here >
This study addresses major barriers to applying synthetic data (SD) in clinical practice—privacy risks, reliance on external models, and limited clinical validation—by implementing a secure, locally deployed SD generation platform for transfusion-dependent β-thalassemia (TDT). Using the TRAIN platform integrated within the privacy-preserving Webthal® environment, the authors trained a CT-WGAN model on a real-world cohort of 779 adults and evaluated synthetic outputs with the SAFE framework to assess fidelity, clinical utility, and privacy. The synthetic cohort closely mirrored real patient distributions and reproduced key clinical findings, including survival outcomes and the association between pre-transfusion hemoglobin levels and mortality. High fidelity (CSF≈0.90–0.91) and strong privacy protection (NNDR≈0.81–0.84) were observed across three scenarios: a 1:1 proxy dataset, an augmented dataset twice the original size, and a conditionally generated dataset tailored to specific clinical characteristics. Augmentation increased statistical power without compromising quality, and conditional generation enabled flexible cohort simulation. Overall, the work demonstrates that secure, clinically validated SD can be generated even for rare diseases, effectively supporting research, digital twin development, and synthetic control arms while eliminating data-sharing barriers and advancing precision hematology.

Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law

SYN-Y2-2025-020: Publication [Binz Vallevik et al., DNV]
International Journal of Law and Information Technology, August, 2025. Read here >
Artificial intelligence (AI) has the potential to transform healthcare, but it requires access to health data. Synthetic data that is generated through machine learning models trained on real data, offers a way to share data while preserving privacy. However, uncertainties in the practical application of the General Data Protection Regulation (GDPR) create an administrative burden, limiting the benefits of synthetic data. Through a systematic analysis of relevant legal sources and an empirical study, this article explores whether synthetic data should be classified as personal data under the GDPR. The study investigates the residual identification risk through generating synthetic data and simulating inference attacks, challenging common perceptions of technical identification risk. The findings suggest synthetic data is likely anonymous, depending on certain factors, but highlights uncertainties about what constitutes reasonably likely risk. To promote innovation, the study calls for clearer regulations to balance privacy protection with the advancement of AI in healthcare.

Should I use Synthetic Data for That? An Analysis of the Suitability of Synthetic Data for Data Sharing and Augmentation

SYN-Y2-2026-021: Publication [Kulynych et al., Lausanne University Hospital]
Arxiv, February, 2026. Read here >
Recent advances in generative modelling have led many to see synthetic data as the go-to solution for a range of problems around data access, scarcity, and under-representation. In this paper, we study three prominent use cases: (1) Sharing synthetic data as a proxy for proprietary datasets to enable statistical analyses while protecting privacy, (2) Augmenting machine learning training sets with synthetic data to improve model performance, and (3) Augmenting datasets with synthetic data to reduce variance in statistical estimation. For each use case, we formalise the problem setting and study, through formal analysis and case studies, under which conditions synthetic data can achieve its intended objectives. We identify fundamental and practical limits that constrain when synthetic data can serve as an effective solution for a particular problem. Our analysis reveals that due to these limits many existing or envisioned use cases of synthetic data are a poor problem fit. Our formalisations and classification of synthetic data use cases enable decision makers to assess whether synthetic data is a suitable approach for their specific data availability problem.