Welcome to SYNTHIA Insight - a series of focused content pieces that bring the science behind SYNTHIA to life. Through interviews with our project partners, we explore views, visions and expertise in synthetic data. Each edition offers an accessible window into the objectives of SYNTHIA and the progress of our work - helping to engage the wider community, spark dialogue, and promote understanding of SYNTHIA’s mission and impact. We invite you to connect with the minds shaping the future of synthetic data.
In SYNTHIA Insight nr. 3 we introduce SYNTHIA Partners Tim Adams and Yasamin Salimi from the Fraunhofer Institute for Algorithms and Scientific Computing (SCAI). In this edition, they discuss the motivations and findings behind the SYNTHIA publication published in the Journal of Prevention of Alzheimer's Disease: A Benchmark of Large Language Models for Semantic Harmonization of Alzheimer's Disease Cohorts.
The study addresses a critical bottleneck in Alzheimer’s disease (AD) research: the challenge of harmonizing heterogeneous datasets. As researchers increasingly rely on multi-cohort studies to generate robust insights, inconsistent variable naming and data structures across datasets limit scalability and interoperability. “Researchers are investing a lot of time manually harmonizing data on a daily basis,” the authors explain. “This process is resource-intensive, and we wanted to explore whether modern AI approaches could help reduce this burden.”
Data Harmonization: A Foundation for Reliable AI and Synthetic Data
Semantic harmonization refers to aligning variables across datasets so that they represent equivalent concepts, even when recorded differently. For example, a variable such as gender may be encoded as “0/1” in one dataset and “male/female” in another. While semantically identical, these differences require additional processing before datasets can be combined.
This step is not only essential for cross-cohort analysis but also foundational for synthetic data generation. Within SYNTHIA, synthetic data is a core focus, but its quality depends heavily on the quality and consistency of the underlying real-world data. By improving data harmonization, researchers can strengthen the data foundation needed to generate reliable and interoperable synthetic datasets. “Through harmonization, we aim to enable research across multiple cohorts,” the authors note. “This allows for more representative analyses and ultimately more robust scientific conclusions.”
Findings: General AI Models Fall Short in Clinical Contexts
To address this challenge, the Fraunhofer SCAI team developed a novel benchmark to evaluate whether large language models (LLMs) and text embedding models can support semantic harmonization in Alzheimer’s disease. The benchmark evaluates five state-of-the-art models across seven AD datasets, mapping cohort metadata to a common data model using variable descriptions spanning clinical, demographic, lifestyle, and imaging data.
Their findings reveal an important insight: models that perform well on general-purpose benchmarks do not necessarily perform well in domain-specific clinical tasks. “We observed that even very recent, large models did not always perform best in this specific use case,” the authors highlight. “In some cases, smaller or older models outperformed them.”
This highlights the importance of domain-specific evaluation frameworks when applying AI in healthcare settings. Without such tailored benchmarks, model performance may be overestimated, leading to unreliable outcomes in real-world applications.
Towards Scalable and Semi-Automated Harmonization
Beyond benchmarking, the study provides practical contributions to the research community. The authors introduce an open-source benchmarking library and interactive leaderboard, enabling researchers to evaluate future models in this domain. They also emphasize the importance of well-structured and descriptive metadata, which significantly impacts the success of both manual and automated harmonization.
“We highlighted how crucial high-quality metadata is,” the authors explain. “Without clear descriptions, even the most advanced models struggle to align datasets correctly.”
The team also outlines guidelines for improving metadata formatting and supporting semi-automated harmonization workflows. While fully automated solutions are not yet mature, these approaches represent an important step toward reducing the burden on researchers and data stewards.
Connecting to SYNTHIA’s Mission
Although this study focuses on semantic harmonization and AI benchmarking, its implications extend directly to SYNTHIA’s broader mission. Synthetic data generation in healthcare depends on high-quality, interoperable datasets. By enabling more efficient and reliable harmonization of Alzheimer’s disease data, this work contributes to building the foundation for robust AI models and future synthetic data applications.
As part of SYNTHIA’s Alzheimer’s disease use case, this research supports the development of scalable, cross-cohort analyses and ultimately advances the potential for synthetic data to drive innovation in neurodegenerative disease research.
Looking ahead, the researchers are optimistic about the role of AI in supporting harmonization workflows. “We hope that these methods will continue to evolve and assist researchers in their daily work,” the authors conclude. “Our benchmark is one step toward making that future possible.”
Watch the video interview:
Listen to the podcast:
For the full publication:

