RTI uses cookies to offer you the best experience online. By clicking “accept” on this website, you opt in and you agree to the use of cookies. If you would like to know more about how RTI uses cookies and how to manage them please view our Privacy Policy here. You can “opt out” or change your mind by visiting: http://optout.aboutads.info/. Click “accept” to agree.
Risk-efficient Bayesian data synthesis for privacy protection
Hu, J., Savitsky, T. D., & Williams, M. R. (2022). Risk-efficient Bayesian data synthesis for privacy protection. Journal of Survey Statistics and Methodology, 10(5), 1370-1399. Article smab013. https://doi.org/10.1093/jssam/smab013
Statistical agencies utilize models to synthesize respondent-level data for release to the public for privacy protection. In this study, we efficiently induce privacy protection into any Bayesian synthesis model by employing a pseudo-likelihood that exponentiates each likelihood contribution by an observation record-indexed weight is an element of[0,1], defined to be inversely proportional to the identification risk for that record. We start with the marginal probability of identification risk for a record, which is composed as the probability that the identity of the record may be disclosed. Our application to the Consumer Expenditure Surveys (CE) of the U.S. Bureau of Labor Statistics demonstrates that the marginally risk-weighted synthesizer provides an overall improved privacy protection. However, the identification risks actually increase for some moderate-risk records after risk-weighted pseudo-posterior estimation synthesis owing to increased isolation after weighting, a phenomenon we label "whack-a-mole." We proceed to construct a weight for each record from a collection of pairwise identification risk probabilities with other records, where each pairwise probability measures the joint probability of reidentification of the pair of records, which mitigates the whack-a-mole issue and produces a more efficient set of synthetic data with lower risk and higher utility for the CE data.