SYNTHIA Insight 8: Testing Synthetic Data Where It Matters Most: From Rare Disease Data to Real-World Clinical Research

20 June 2026

Welcome to SYNTHIA Insight – a series of focused content pieces that bring the science behind SYNTHIA to life. Through conversations with our project partners and contributors, we explore the ideas, innovations, and expertise shaping the future of synthetic data in healthcare. Each edition offers an accessible window into the objectives of SYNTHIA and the progress of our work, helping to engage the wider community and promote understanding of the project's mission and impact.

In SYNTHIA Insight nr. 8, we highlight the publication "AI-based, secure and privacy-preserving synthetic data generation platform in transfusion-dependent β-thalassemia applied to the Webthal® dataset" by Mattia Delleani Founding Engineer at TRAIN, published in the journal Blood (journal of the American Society of Hematology).

Why synthetic data needs to work in real clinical settings

Synthetic data has become one of the most promising approaches for enabling healthcare research while protecting patient privacy. Yet despite significant advances in artificial intelligence and generative modelling, applying synthetic data in real clinical environments remains challenging. Many existing approaches require sensitive patient information to be uploaded to external platforms or cloud-based systems. At the same time, not all generative models adequately capture the complex relationships that exist between clinical variables, raising important questions about their reliability for healthcare decision-making. These challenges become even more significant in rare diseases, where patient populations are small and access to high-quality data is limited.

This publication explores how these barriers can be addressed through a secure and privacy-preserving synthetic data generation platform applied to transfusion-dependent β-thalassemia (TDT), a rare inherited blood disorder. Using the Webthal® registry, the researchers demonstrated how synthetic data can be generated locally within a secure clinical environment while maintaining both clinical relevance and patient privacy.

Addressing three critical challenges

This publication set out to address three key obstacles that continue to limit the broader adoption of synthetic data in healthcare:

The privacy and security concerns associated with sharing sensitive patient information.
The challenge of ensuring that synthetic data accurately reflects complex clinical relationships.
The scarcity of data in rare diseases, where conventional studies often struggle to recruit sufficient patient numbers.

To overcome these challenges, the team implemented an AI-based synthetic data generation platform directly within the secure environment of the Webthal® dataset. This approach allowed synthetic data to be generated without sensitive patient information ever leaving the hospital infrastructure.

The broader goal was to demonstrate that synthetic data can become a practical tool for clinical research, digital twins, and future synthetic control arms for clinical trials.

Why rare diseases are an important testing ground

Rare diseases create a fundamental paradox. Registries such as Webthal® contain decades of expertise and valuable patient outcomes, yet individual centers often manage only 50 to 100 patients. This creates significant challenges for generating statistically robust evidence and conducting prospective studies.
Synthetic data offers an opportunity to amplify existing knowledge while preserving patient privacy. Researchers can use synthetic cohorts to design and pre-test clinical trials, explore new hypotheses, and facilitate collaboration across institutions through privacy-preserving data proxies.

Although β-thalassemia is not one of SYNTHIA's clinical use cases, the work demonstrates how synthetic data generation and validation frameworks can be successfully applied to complex healthcare datasets beyond the our core disease areas. In this sense, it serves as an important example of the generalizability of the methods and validation approaches being developed within SYNTHIA.

How the approach works

The TRAIN platform trains a Conditional Tabular Wasserstein Generative Adversarial Network (CT-WGAN) entirely within the secure Webthal® environment, meaning that no sensitive patient data ever leaves the hospital infrastructure.
Rather than reproducing existing records, the model learns the statistical structure of real patient cohorts and generates new synthetic patients that statistically resemble the original population while remaining mathematically constructed.

To evaluate the quality and trustworthiness of the generated data, the researchers applied the SAFE validation framework. SAFE assesses synthetic datasets across three critical dimensions:

Statistical fidelity, measuring how closely the synthetic data reflects the original cohort.
Clinical utility, evaluating whether real clinical outcomes can be reproduced.
Privacy preservation, demonstrating that individual patients cannot be recovered or re-identified.

The results showed high clinical fidelity while maintaining strong privacy guarantees. Importantly, the synthetic cohort successfully replicated key clinical outcomes from the original dataset, including Kaplan-Meier survival curves and hazard ratios. No identical matches between real and synthetic records were identified.

The study also demonstrated that synthetic augmentation could substantially increase statistical power by effectively expanding the size of rare disease cohorts. This suggests significant opportunities to accelerate research in areas where conventional studies are often underpowered.

"Synthetic data will be increasingly useful for future clinical applications to reduce privacy barriers, but realization requires proper validation frameworks and regulatory guidelines. We cannot afford to treat synthetic data as a black box; rigorous validation is the foundation for trustworthy deployment in real-world clinical settings."

- Mattia Delleani, Founding Engineer at TRAIN

What this means for researchers and clinicians

For researchers, this approach has the potential to democratize access to complex rare disease cohorts. Instead of spending months navigating data-access agreements and governance procedures, validated synthetic datasets could provide a privacy-preserving way to rapidly explore hypotheses and conduct preliminary analyses.

For clinicians, synthetic cohorts derived from large collections of patient experiences could support evidence-based decision-making without compromising patient confidentiality. The implications also extend to industry. Pharmaceutical companies could use realistic synthetic populations to prototype trial designs, simulate recruitment strategies, and evaluate study feasibility before committing to costly prospective trials in diseases where patient recruitment is already challenging.

Building trustworthy synthetic data for healthcare

A central message of this publication is that synthetic data should not be judged solely by how realistic it appears. Instead, it should be evaluated by its ability to reproduce real clinical outcomes while preserving privacy.

The SAFE framework addresses a major challenge in the field: the absence of standardized validation approaches for synthetic data. By combining statistical, clinical, and privacy assessments, SAFE provides a structured method for determining whether synthetic datasets are genuinely fit for purpose.
This aligns closely with SYNTHIA's ambition to develop a validated and trustworthy framework for synthetic data generation and evaluation. Importantly, SAFE is not tied to a specific generative architecture. Whether synthetic data is generated through GANs, diffusion models, variational autoencoders, or other approaches, the framework provides a common set of clinical and privacy metrics against which different methods can be assessed.

Looking ahead

This publication challenges a common assumption in the synthetic data field: that realism alone is enough.

As Mattia Delleani explains, the goal should not simply be to generate data that looks realistic, but to demonstrate that synthetic datasets can reliably reproduce clinically meaningful outcomes while maintaining robust privacy protections. This represents an important shift in perspective. Rather than asking whether synthetic data resembles real data, the focus moves towards whether it can generate the same clinical insights and support the same decisions. By embedding validation into the development process, studies such as this help build the trust needed for broader adoption of synthetic data across healthcare research, clinical innovation, and future regulatory applications.

Read the full publication here >