Welcome to SYNTHIA Insight - a series of focused content pieces that bring the science behind SYNTHIA to life. Through interviews with our project partners, we explore views, visions and expertise in synthetic data. Each edition offers an accessible window into the objectives of SYNTHIA and the progress of our work - helping to engage the wider community, spark dialogue, and promote understanding of SYNTHIA’s mission and impact. We invite you to connect with the minds shaping the future of synthetic data. 


In SYNTHIA Insight nr. 6 we introduce the publication, DeCaFlow: A Deconfounding Causal Generative Model, presented and published at NeurIPS. Developed by SYNTHIA researchers from Universidad Politécnica de Madrid, Alejandro Almodóvar, Juan Parras and Santiago Zazo, the study introduces a novel approach to ensure that synthetic data does not just look realistic, but also captures the cause-and-effect relationships that underpin meaningful analysis. 

With the growing use of synthetic data in healthcare and life sciences, an important question is emerging: what makes synthetic data truly reliable for real-world decision-making? 


Distinguishing correlation from cause 

DeCaFlow helps distinguish correlation from cause when important factors are not directly observed. In healthcare and life-science data, hidden influences are common: disease severity, care pathways, clinician decisions, or social context may affect several variables at once. If these hidden factors are ignored, synthetic data or downstream analyses may look realistic but still encode misleading cause-effect relationships. 

The research focuses on making data more useful for decision-making by helping move from simple correlations to more reliable cause-and-effect reasoning. This can be best understood through the types of questions the method can answer. For example: 

  • What would be the outcome if we changed treatment for a group of patients? 
  • At the individual level, for a specific patient, what might the outcome have been under a different treatment choice? 

These are the kinds of questions that clinicians, researchers, regulators, and industry stakeholders care about when evaluating interventions. 


Beyond realism: supporting meaningful “what-if” questions 

A key insight of this research is that synthetic data should not only look realistic, but also support meaningful “what-if” questions about decisions and outcomes. DeCaFlow contributes to this by enabling the generation of synthetic data that can answer these questions more reliably, even when some important variables were not measured in the original data. The main conclusion is that the usefulness of synthetic data depends on more than realism.  

It also depends on whether the data preserves the relationships that matter for analysis and decision-making. This work strengthens this aspect by focusing on cause-and-effect relationships rather than only patterns in the data.


 

“Synthetic data should not only look realistic, but also support meaningful ‘what-if’ questions about decisions and outcomes.” 

- Alejandro Almodóvar, Assistant Professor & Researcher, UPM 


Improving reliability, validity, and fairness 

In terms of implications, this approach contributes to improving reliability and scientific validity. It reduces the risk of drawing incorrect conclusions due to missing or unobserved factors and allows more robust evaluation of interventions. It can also support fairness, since better modelling of underlying mechanisms can help avoid biased or misleading associations. 


Why this matters for adoption 

Trust is essential for the broader adoption of synthetic data in healthcare and life sciences. Stakeholders need to be confident not only that data is privacy-preserving, but also that it can support meaningful research, validation of AI systems, and policy decisions. Methods like DeCaFlow help move synthetic data in that direction, making it more suitable for real-world use cases. 


Strengthening SYNTHIA’s framework 

This work contributes to SYNTHIA’s ambition to build a validated and trustworthy synthetic data framework by strengthening one key dimension of trustworthiness: the ability of synthetic data to support meaningful analysis and decision-making.  While SYNTHIA develops a broader framework for validated, privacy-preserving synthetic data, DeCaFlow focuses on ensuring that the relationships captured in the data remain useful for answering relevant clinical questions. The approach complements other methods within SYNTHIA because it does not replace existing data generation or privacy techniques. Instead, it adds a layer that focuses on the quality and interpretability of the data from a cause-and-effect perspective. This helps ensure that downstream analyses, such as AI models or clinical studies, are based on meaningful signals rather than spurious correlations. 


 

“The usefulness of synthetic data depends on more than realism — it depends on whether the data preserves the relationships that matter for analysis and decision-making.” 

- Juan Parras, Associate Professor & Researcher, UPM 


From methodology to real-world use cases 

In real-world healthcare use cases such as oncology, Alzheimer’s disease, or diabetes, it is very common that some important variables are missing from the available data. This can limit the reliability of analyses and the usefulness of synthetic datasets. If a causal structure can be defined, for example based on clinical expertise or prior knowledge, DeCaFlow can still be applied even when not all relevant variables are directly observed. This allows the method to be used both for generating synthetic data and for validating whether that data preserves meaningful clinical relationships. In practice, this means that synthetic datasets can better reflect how treatments, patient characteristics, and outcomes are connected, rather than only reproducing surface patterns. The main requirement, and also the main limitation, is the need for a causal graph. However, this is also a key strength: by relying on this structure, the model generates data that respects the direction of causality. This is particularly valuable in complex healthcare scenarios, where understanding how different factors influence outcomes is essential for research, clinical decision-making, and the development of trustworthy AI systems. 


Looking ahead 

Advancing synthetic data requires ensuring that the data supports meaningful and trustworthy analysis. From Universidad Politécnica de Madrid, the contribution to SYNTHIA focuses on the technical foundations that make this possible. DeCaFlow is part of this effort to ensure that synthetic data can be used not only safely, but also effectively, for answering real scientific and clinical questions. More broadly, synthetic data will have the greatest impact when it enables better understanding, better decisions, and ultimately better outcomes.


 “Synthetic data will have the greatest impact when it enables better understanding, better decisions, and ultimately better outcomes.” 

- Santiago Zazo, Professor & Researcher, UPM