Computing in the Browser

Advancing Multimodal Synthetic Data Generation

Motivation

In the era of big data, healthcare is witnessing a paradigm shift toward data-driven solutions. The Digital Health Twin (DHT) framework stands at the forefront of this revolution, offering a virtual, personalized model of patient health. However, the adoption of DHTs faces significant hurdles: the need for privacy-preserving analytics, handling diverse and distributed datasets, and generating high-quality synthetic data to overcome data access limitations.

The Problem

The implementation of DHTs encounters four critical challenges:

Mixed-Tail Data Behavior: Real-world health data often exhibit both common and rare events, making it difficult for models to capture the complexity of the data's distribution.
Fidelity-Diversity Trade-Off: Balancing the accuracy of synthetic data representation with sufficient diversity to reflect real-world complexities.
Privacy: Ensuring synthetic data remains private while retaining its utility for advanced analytics.
Geographic and Data Disparities: Addressing non-uniform, distributed data collected across diverse populations and regulatory jurisdictions.

Research Questions

This body of work tackles these challenges through four key research questions:

How can synthetic data effectively capture the mixed-tail behavior of real data?
How can we ensure high fidelity and diversity in synthetic data while respecting the boundaries of real data?
How can privacy-preserving techniques ensure data utility without compromising sensitive information?
How can distributed analytics support fair and performant data analysis across decentralized environments?

Addressing the Challenges

The following publications provide innovative solutions to these questions:

1. Capturing Mixed-Tail Behavior

Generating Heavy-Tailed Synthetic Data with Normalizing Flows: Developed techniques to model and generate heavy- and mixed-tailed data distributions, essential for realistic synthetic datasets.
Practical Synthesis of Heavy- and Mixed-Tailed Data with Normalizing Flows: Proposed robust mean and gradient estimation methods, improving synthetic data quality for medical applications.

2. Balancing Fidelity and Diversity

Generating Heavy-Tailed Synthetic Data with Normalizing Flows. Introduced methods to optimize fidelity and diversity in synthetic data, enabling comprehensive modeling of health data.

3. Privacy-Preserving Data Synthesis

Differential Privacy vs. Detecting Copyright Infringement: A Case Study with Normalizing Flows: Pioneered a novel method to achieve differential privacy in generative models without adding noise, safeguarding sensitive health data.

Compressive Differentially Private Federated Learning Through Universal Vector Quantization: Enhanced privacy in federated learning by combining compression with differential privacy, reducing communication costs while maintaining security.

4. Distributed Data and Fairness

On the Impact of Non-IID Data on the Performance and Fairness of Differentially Private Federated Learning: Examined the challenges of training on non-uniform datasets, offering solutions for equitable and effective distributed analytics.

The Outcome

This research establishes a robust foundation for operationalizing Digital Health Twins:

Synthetic data frameworks that combine privacy, fidelity, and diversity.
Federated learning systems capable of addressing real-world data heterogeneity.
Scalable, privacy-preserving solutions for distributed medical data analytics.

By addressing these critical challenges, our research paves the way for innovative, equitable, and secure healthcare systems. The advancements enable personalized medicine, predictive analytics, and global collaboration without compromising patient privacy or data integrity. more can details can found in the following papers

Amiri, Saba, et al. "Compressive differentially private federated learning through universal vector quantization." AAAI Workshop on Privacy-Preserving Artificial Intelligence. 2021. https://ppai21.github.io/files/29-paper.pdf [Video]
Amiri, Saba, et al. "On the impact of non-IID data on the performance and fairness of differentially private federated learning." 2022 52nd Annual IEEE/IFIP Int. Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 2022. DOI: 10.1109/DSN-W54100.2022.00018
Amiri, Saba, et al. "Generating Heavy-Tailed Synthetic Data with Normalizing Flows." 5th WP Tractable Probabilistic Modeling Link to the publication: https://openreview.net/forum?id=PbvyJ8XpNn
Amiri, Saba, et al. " Differential Privacy vs Detecting Copyright Infringement: A Case Study with Normalizing Flows” GenLaw '23 in person in Honolulu, Hawai’i, https://blog.genlaw.org/CameraReady/60.pdf
Amiri, Saba, et al. " Practical Synthesis of Heavy- and Mixed-Tailed Data with Normalizing Flows." Link to the publication: https://openreview.net/forum?id=uphsKDj0Uu
Deep learning, Privacy, Security, Distributed learning Private and Secure Distributed Deep Learning: A Survey Link to the publication: https://dl.acm.org/doi/10.1145/3703452

Adam S.Z. Belloum

a.s.z.belloum@uva.nl / a.belloum@esciencecenter.com