Developing Measures of Disclosure Risk and Utility for Synthetic Data
I am studying Synthetic Data, which is a tool used to protect data confidentiality. Synthetic data is generated based upon an existing dataset to create a new dataset that contains none of the information of the original participants, yet yields valid statistical analyses. I am looking not only at methods used to generate synthetic data, such as multiple imputation and CART, but also looking at tests of data utility and tests for risk assessment. The field of Statistical Disclosure Control (SDC), which involves techniques such as suppression, additive noise, etc., has many different measures for both data utility and disclosure risk, which can possibly be applied to synthetic data. Additionally, I hope to use my research to work with the ONS to create a synthetic version of the census.
- What are the best approaches for producing synthetic data from census microdata in terms of optimising data utility and disclosure risk?
- What are the disclosure risks in synthetic census data and how do they relate to the methods used to produce the synthetic data from the underlying real data?
- How should disclosure risks be quantified for synthetic census data?
- How should data utility in synthetic census data be measured?
- Do synthetic versions of whole populations produce higher data utility than disclosure controlled samples?
September 2016 - September 2019