Statistical Disclosure Control
Person-level data from surveys and other research studies usually contain sensitive information, subject to mandatory statistical disclosure control (SDC). There is an increasing demand for public use microdata files (PUFs). Thus, SDC is needed before PUFs can be released. Other forms of data release, such as restricted-use files, statistical tables, and modeling results also require SDC. As an institution that collects and analyses data, we work to develop technologies that protect data and ensure individual respondent security and privacy throughout the research process.
MASSC
We have created an innovative statistical disclosure control methodology for the creation of public-use micro data files (PUFs). Called MASSC (an acronym for its four major steps: Micro-Agglomeration, optimal probabilistic Substitution, optimal probabilistic Subsampling, and optimal sampling weight Calibration), this methodology was patented by RTI in 2006. MASSC is currently being used for the creation of the PUFs for the National Survey on Drug Use and Health (NSDUH). See the publications listed below for more information.
MASSC Capabilities
-
Counters the threat of identification by an inside intruder (e.g., a family member of a respondent) by subsampling
-
Counters the threat of identification by an outside intruder by its substitution step
-
Improves estimate accuracy by weight calibration
-
Optimally applies subsampling and substitution conditioned by constraints on the bias and variance resulting from this treatment of the data
MASSC Publications
-
Singh, A. C. (2002, 2006). Method for statistical disclosure limitation. U.S. Patent Application Pub. No. US 2004/0049517A1: Patent granted June 2006. Patent no. US7058638B2.
-
Singh, A.C., Yu, F., Dunteman, G.H. (2003). MASSC: A new data mask for limiting statistical information loss and disclosure. Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, Luxembourg, pp. 373-394. (www.unece.org)
-
Singh, A., F. Yu, Wilson, D.H. (2004). Measuring disclosure risk and information loss for MASSC-treated micro-data. Proceedings of the American Statistical Association, Toronto, Canada, pp. 4374-4381.
-
Yu, F., L, Dai, M. Feder, and J. R. Chromy (2006). "Creation of public use micro-data files for the National Survey on Drug Use and Health (NSDUH). Proceedings of Statistics Canada Symposium 2006, Methodological Issues in Measuring Population Health.
Data Treatment
We also have experience with other techniques to treat the data to minimize disclosure risk while maintaining the analytic utility of the data. Some procedures include the following:
-
Analyzing risk levels of the variables (i.e., identifying which variables are most likely to be used to identify an individual)
-
Deleting identifying variables from a PUF, such as name, address, and social security number
-
Applying data coarsening techniques to continuous variables or categorical variables with extreme outliers, such as top coding, bottom coding, recasting into categorical form, or collapsing levels of categorical variables
-
Data swapping by exchanging data between randomly selected individuals for certain variables
Additional steps taken in the treatment include the computation of measures of the risk of disclosure and the loss of statistical information due to treatment.
SDC for Online Analysis Systems
A number of statistical agencies, both in the U.S. and abroad, are developing public-access online data analysis systems running on sensitive data. To counter disclosure risks, specific SDC methods are required. RTI currently treats the data for some online systems and is currently working on developing SDC for other online analysis systems.