SYNTHIA Insight 7: Making Privacy Guarantees Meaningful in Synthetic Data

04 June 2026

Welcome to SYNTHIA Insight - a series of focused content pieces that bring the science behind SYNTHIA to life. Through interviews with our project partners, we explore views, visions and expertise in synthetic data. Each edition offers an accessible window into the objectives of SYNTHIA and the progress of our work - helping to engage the wider community, spark dialogue, and promote understanding of SYNTHIA’s mission and impact. We invite you to connect with the minds shaping the future of synthetic data.

In SYNTHIA Insight nr. 7 we reintroduce Bogdan Kulynych, Research Scientist, and Jean Louis Raisaro, Assistant Professor, at the Biomedical Data Science Center of Lausanne University Hospital (CHUV). In this edition, they discuss the motivations and findings behind the SYNTHIA publication published in NeurIPS 2025, a leading machine learning conference, titled: Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy.

In this study the research team solves an open problem in data and model release in regulated environments: Can we ensure that a release has provably low enough re-identification or inference risk, as prescribed by data protection guidelines, and at the same time is useful? The previous methods for doing so either do not provide provable guarantees or cannot guarantee a reasonable level of these risks when the release is useful.

“Responsible synthetic data starts with rigorous science. Our work helps bridge the gap between theoretical privacy guarantees and the real-world risks that matter for trustworthy data sharing.”

- Bogdan Kulynych, Research Scientist, CHUV

Differential Privacy and Its Limitations

In many settings, organizations fit statistical or machine learning models on privacy-sensitive data. These include generative models to create synthetic data, which is the focus of the SYNTHIA project. These models or their outputs can potentially leak personal data about people whose information was in the dataset.

A well-established technical solution for limiting such leakage is using controlled added randomization. The standard framework for analyzing privacy guarantees that randomization provides is known as differential privacy. Within the framework, achieving better privacy guarantees requires adding more randomness to the statistical or machine learning algorithm. The standard practice in differential privacy is to quantify the amount of privacy that randomness provides using a parameter called ε, with values close to zero indicating a high level of privacy. However, we are often interested in quantifying the leakage in terms of risk of realistic attacks. In particular, guidelines such as those from ISO or European Medicines Agency specifically talk about re-identification or inference risk.

Most standard methods for making privacy guarantees interpretable in terms of these operational risks map the parameter ε to risk. In this paper and previous work by Kulynych, the authors have shown that such mapping is extremely ineffective. When mapping high epsilon values (e.g., we often use ε values of 5–10 to have reasonable utility in practice) using standard approaches, the maximal risk appears as high as 99%, so the provable privacy guarantees seem meaningless. In contrast, empirical testing shows values of epsilon as high as 100 often limit practical attacks well.

A More Precise and Unified Approach

As it turns out, this apparent high risk is an artifact of ineffective conversion. Using a recent decision-theoretic approach called f-differential privacy, the authors provide a method that finds maximum risk with substantially better precision. As an example, using a common technique to ensure privacy called Gaussian mechanism, in a specific setting where the standard approach indicates 99% risk, the authors find that the maximum risk any attacker can achieve is only 26%. Their method is not only more precise, but unifying. It enables quantification of maximal risk of various attacks at the same time: singling out, attribute inference, and partial or full data-record reconstruction.

“To make synthetic data truly useful in healthcare, we need privacy guarantees that are both mathematically sound and practically meaningful. This work moves us closer to that goal.”

- Jean-Louis Raisaro, Assistant Professor, CHUV

Bridging Theory and Practice

This work provides a technical method that bridges the gap between operational requirements on privacy risk and differential privacy. Differential privacy is the standard toolbox for ensuring provable privacy guarantees when fitting statistical models or training machine learning models on sensitive data, in particular, when creating synthetic data that statistically mimics a private dataset.

The publication, Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy was published in NeurIPS, which stands for 'Neural Information Processing Systems Foundation' and is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.

Read the full publication here >

This publication builds on previous work from this SYNTHIA team exploring how synthetic data can be evaluated responsibly in healthcare settings. In a previous SYNTHIA Insight, Bogdan Kulynych, Jean-Louis Raisaro and Bayrem Kaabachi discussed their publication “A scoping review of privacy and utility metrics in medical synthetic data,” where they examined the current landscape of methods used to assess privacy protection and data usefulness in synthetic healthcare data.

Read it here >