Welcome to SYNTHIA Insight – a series of focused content pieces that bring the science behind SYNTHIA to life. Through conversations with our project partners and contributors, we explore the ideas, expertise, and innovations shaping the future of synthetic data in healthcare. Each edition offers an accessible window into the objectives of SYNTHIA and the progress of our work, helping to engage the wider community, spark dialogue, and promote understanding of the project's mission and impact. 


In SYNTHIA Insight No. 9, we spotlight Gianluca Asti, Computer Vision Specialist in Humanitas, and the publication "Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases", published in Blood (Journal of the American Society of Hematology). This study was a collaborative effort involving partners from Humanitas, TRAIN, UPM, Vicomtech, and SBA, combining expertise in AI, federated learning, synthetic data generation, and validation. 

At the center of this research is a question that is highly relevant to healthcare innovation: how can researchers safely unlock the value of patient data while respecting privacy requirements and enabling collaboration across institutions? For Gianluca Asti, the answer lies in combining synthetic data generation, federated learning, and rigorous validation frameworks to support research in complex and rare hematological diseases. 


Advancing research through privacy-preserving collaboration 

Access to large, diverse, and high-quality datasets is essential for advancing research in rare hematological diseases. Yet privacy regulations and institutional silos often limit the ability of centers to share data and collaborate effectively. The publication explores how synthetic data generation can help overcome these barriers. By combining generative artificial intelligence with federated learning, researchers can train models across multiple institutions without moving sensitive patient data outside local environments. 

As highlighted in the study, synthetic data offer several transformative advantages. By mimicking the clinical and genomic characteristics of real patients without reproducing identifiable information, synthetic datasets can enable secure and scalable data sharing while respecting privacy constraints. They can also support data augmentation, missing data imputation, and cohort balancing. 
The study demonstrated that synthetic data could successfully anticipate molecular classifications and prognostic scoring systems, illustrating their potential to accelerate translational research in hematology. 


From technical innovation to real-world impact 

While the underlying methods are highly technical, their practical implications are clear. The combination of synthetic data generation, federated learning, and the SAFE validation framework creates a privacy-preserving and high-utility approach for advancing precision medicine and collaborative research in rare hematological diseases. 

According to Asti, this approach can support clinical trial innovation through the development of synthetic control arms, helping reduce the need for placebo groups and potentially streamlining trial design. Synthetic data can also help balance data cohorts and provide clinicians and researchers with access to larger, higher-quality datasets in fields where patient numbers are often limited. Most importantly, federated learning strengthens multicentric collaboration by increasing data availability while maintaining compliance with privacy regulations. This can accelerate the development of personalized precision medicine tools and help bring research results to patients more quickly. 


 

"Federated learning represents a shift in the AI training paradigm. The possibility to train AI models on larger multicentric datasets while respecting GDPR requirements is vital for advancing research in rare diseases and accelerating the development of personalized precision medicine." 

- Gianluca Asti, Computer Vision Specialist, Humanitas 


Building trust through validation 

A key contribution of this work is the use of the SAFE validation framework, developed and applied through collaboration between Humanitas and TRAIN. While partners from UPM, Vicomtech, and SBA focused on developing the AI models, Humanitas and TRAIN led the validation activities using SAFE, a framework designed to assess not only statistical fidelity and privacy preservation, but also clinical relevance. 

Asti explains, one of the distinctive features of SAFE is the inclusion of a clinical validation component alongside traditional statistical and privacy tests. This component evaluates whether the original clinical message contained within the real-world data is preserved in synthetic data. By verifying that clinically meaningful relationships and outcomes remain intact, SAFE helps improve confidence in the quality and trustworthiness of synthetic datasets. The framework is also robust, interpretable, and adaptable, making it suitable for evaluating synthetic data across different healthcare domains and use cases. 


Relevance across SYNTHIA 

Although this publication focuses on hematological diseases, its implications extend far beyond a single therapeutic area. The techniques discussed in the study can be applied across other SYNTHIA use cases, including oncology, Alzheimer's disease, and diabetes. Likewise, the SAFE framework provides a flexible validation pipeline that can be adapted to multiple settings and used to compare the performance of different synthetic data generation approaches. This aligns closely with SYNTHIA's ambition to build a validated and trustworthy framework for synthetic data generation in healthcare. By integrating privacy-preserving AI methods with rigorous validation, the work contributes to the development of approaches that are not only technically advanced but also clinically meaningful and trustworthy. 


Looking ahead 

Reflecting on the broader significance of the work, Asti highlights federated learning as a fundamental shift in the way AI models can be trained in healthcare. The possibility of training AI models on larger multicentric datasets while respecting GDPR requirements is essential for advancing research in rare diseases and supporting the development of personalized precision medicine. At the same time, robust validation frameworks remain critical. Ensuring statistical fidelity and privacy protection is important, but demonstrating that synthetic data can reliably reproduce clinically relevant findings is equally essential for building trust and enabling adoption. 

Together, federated learning, synthetic data generation, and comprehensive validation frameworks such as SAFE represent important building blocks for the future of collaborative healthcare research, helping researchers work together more effectively while keeping patient privacy at the center of innovation.