Academia.eduAcademia.edu

Statistical disclosure control

description256 papers
group80 followers
lightbulbAbout this topic
Statistical disclosure control is a set of methods and techniques used to protect the confidentiality of individual data in statistical outputs while maintaining the utility of the data. It aims to prevent the identification of individuals or sensitive information in published statistics through various anonymization and perturbation strategies.
lightbulbAbout this topic
Statistical disclosure control is a set of methods and techniques used to protect the confidentiality of individual data in statistical outputs while maintaining the utility of the data. It aims to prevent the identification of individuals or sensitive information in published statistics through various anonymization and perturbation strategies.

Key research themes

1. How can information leakage be quantified and mitigated when adversaries have imperfect knowledge of joint data distributions?

This research area focuses on refining information leakage metrics to better capture privacy risks when adversaries do not possess complete statistical information about the data and mechanisms. Traditional metrics assume full knowledge of data distributions, an assumption that often fails in practical scenarios. Addressing this gap is crucial for designing privacy-utility trade-offs and optimal disclosure mechanisms under realistic adversarial uncertainty.

Key finding: Introduced novel information-theoretic leakage metrics that account for adversaries lacking full knowledge of joint statistics between private and disclosed data. Experimental results demonstrated that these metrics better... Read more
Key finding: Proposed a risk-aware access control framework that evaluates disclosure risk dynamically and employs adaptive anonymization to mitigate risk in real-time. This approach extends classical binary access control by integrating... Read more
Key finding: Demonstrated a method combining k-anonymity and l-diversity to balance privacy and utility by classifying equivalence classes into utility-preserving and outlier groups and reducing outlier classes. Experiments showed... Read more

2. What are the trade-offs between differential privacy guarantees and data utility in statistical disclosure control for official statistics and census data?

This theme examines the challenges and methodologies in implementing differential privacy (DP) and similar noise-injection mechanisms in official statistical releases. It focuses on balancing rigorous privacy protections against the utility of statistical outputs, especially in the context of sensitive, high-dimensional population and employer-employee datasets. Issues such as noise distribution choice, bounded vs. unbounded noise, and output complexity effects on privacy-utility trade-offs are investigated.

Key finding: Provided a comprehensive analysis distinguishing differential privacy as a risk measure from noisy output mechanisms that enforce privacy. Showed that unbounded noise distributions (e.g., Laplace) required by strict DP may... Read more
Key finding: Developed new algorithms with provable privacy guarantees tailored to linked employer-employee data, using the Pufferfish privacy framework aligned with legal requirements. Empirical evaluation on US Census production data... Read more
Key finding: Introduced an accuracy-optimal mechanism for relaxing privacy levels over time without loss of accuracy when releasing differential private data in multiple releases. Demonstrated that correlated noise addition can achieve... Read more
Key finding: Presented the TopDown Algorithm (TDA), a large-scale implementation of zero-Concentrated Differential Privacy in the 2020 US Census. The TDA applied differentially private noise to hierarchical tabulations while incorporating... Read more

3. How can synthetic data and related statistical disclosure control methods preserve data utility for machine learning and statistical inference while ensuring privacy?

This theme investigates techniques for generating privacy-preserving synthetic datasets and their impact on downstream analytical tasks, including machine learning classification and inference on covariance structures. It covers evaluation of synthetic data generators, the role of anonymization (e.g., microaggregation enhanced by linear discriminant analysis), and statistical procedures adapted for synthetic datasets, balancing confidentiality protection with preserving empirical data utility.

Key finding: Conducted empirical evaluation of supervised machine learning models trained on synthetic health datasets generated using classification and regression trees, parametric, and Bayesian network methods. Found minimal... Read more
Key finding: Proposed an anonymization method integrating linear discriminant analysis to rotate and scale data towards classification thresholds before k-anonymous microaggregation. This approach preserves machine learning accuracy... Read more
Key finding: Derived finite-sample valid statistical tests for generalized variance, sphericity, independence, and regression coefficients based solely on singly imputed synthetic datasets generated via plug-in sampling under a... Read more

All papers in Statistical disclosure control

The amount of computer-stored information is growing faster with each passing day. This growth and the way in which the stored data are accessed through a variety of channels have raised the alarm about the protection of the individual... more
Microaggregation is a Statistical Disclosure Con trol (SDC) technique that aims at protecting the privacy of individual respondents before their data are released. Optimally microaggregating multivariate data sets is known to be an... more
Microaggregation is a clustering problem with cardinality constraints that originated in the area of statistical disclosure control for microdata. This article presents a method for multivariate microaggregation based on genetic... more
Micro-data protection is a hot topic in the field of Statistical Disclosure Control (SDC), that has gained special interest after the disclosure of 658000 queries by the AOL search engine in August 2006. Many algorithms, methods and... more
Microaggregation is a Statistical Disclosure Con trol (SDC) technique that aims at protecting the privacy of individual respondents before their data are released. Optimally microaggregating multivariate data sets is known to be an... more
The amount of computer-stored information is growing faster with each passing day. This growth and the way in which the stored data are accessed through a variety of channels have raised the alarm about the protection of the individual... more
Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released without disclosing private information on... more
Increased corporate, government and academic demand has prompted official statistics to release individual respondent data (microdata) in addition to the traditional tabular data. Microdata must be masked by a statistical disclosure... more
Unlike aggregated census tabulations, census microdata provide information about individual persons and households. This makes it possible for researchers to design analyses tailored to their particular research questions. Other microdata... more
The application of many anonymization methods is complex and requires knowledge of the methods and access to suitable tools for implementation. For users comfortable with using R, the package sdcMicro [1] provides a tool for the... more
Blocking is a well-known technique used to partition a set of records into several subsets of manageable size. The standard approach to blocking is to split the records according to the values of one or several attributes (called blocking... more
Protecting personal data is essential to guarantee the rule of law 1 . Due to the new Information and Communication Technologies (ICTs) unprecedented amounts of personal data can be stored and analysed. Thus, if the proper measures are... more
k-Anonymity is a privacy model requiring that all combinations of key attributes in a database be repeated at least for k records. It has been shown that k-anonymity alone does not always ensure privacy. A number of sophistications of... more
Microaggregation is a clustering problem with cardinality constraints that originated in the area of statistical disclosure control for microdata. This article presents a method for multivariate microaggregation based on genetic... more
Statistical Disclosure Control (SDC) is an active research area in the recent years. The goal is to transform an original dataset X into a protected one X , such that X does not reveal any relation between confidential and... more
Microaggregation for Statistical Disclosure Control (SDC) is a family of methods to protect microdata from individual identification. SDC seeks to protect microdata in such a way that can be published and mined without providing any... more
k-anonymous microaggregation is a standard technique to improve privacy of individuals whose personal data is used in microdata databases. Unlike semantic privacy requirements like differential privacy, k-anonymity allows the unrestricted... more
This paper describes a geographically intelligent approach to disclosure control for protecting flexibly aggregated census data. Increased analytical power has stimulated user demand for more detailed information for smaller geographical... more
44 45 46 (58) Field of Classification Search ................ 358/1.13, 358/1.18, 1.15, 1.16; 709/231, 234 See application file for complete search history. (56) References Cited U.S. PATENT DOCUMENTS 4,649,513 A * 3/1987 Martin et al.... more
This document provides a comprehensive critical literature analysis of Statistical Disclosure Control (SDC), highlighting its methodologies, applications, and implications for data protection. The significance of SDC lies in its critical... more
The assessment of statistical disclosure risk often requires the linking of data. There are effective means of linking data for simple scenarios; but it is not clear how best to approach linkage for more complex scenarios. We examine... more
An important aspect of disclosure control is the isolation and control of individual-level records that have a high probability of being identified (as their contents, or variables. are unusual) consider, for example, a sixteen-year-old... more
The Key Variable Mapping System (KVMS) is an approach for identifying matching possibilities across datasets within a data environment. It is a formalised approach for identifying key variables. An overview of KVMS is provided in Elliot... more
As data mining is used to extract valuable information from large amount of data. But this is harmful in some cases so some kind of protection is required for sensitive information. So privacy preserving mining is emerge with the goal to... more
This paper surveys the fields of Statistical Disclosure Control (SDC) and Micro-Aggregation Techniques (MATs), which are both areas fundamental to the science of secure Statistical DataBases (SDBs). The paper is written from the... more
who was abundantly helpful and offered invaluable assistance, support and guidance. Deepest gratitude is also due to
data user is assessed on two dimensions ing information Ways exist however to resolve the horizontal axis is the level of knowledge about this value paradox in an important context Sta-the legitimate object of empirical inquiry the verti... more
With the surging demand for Internet of Things (IoT) healthcare applications, a myriad of data privacy concerns come to light. Cloud computing inherits the risks of exposing data to re-identification vulnerabilities. A secure solution is... more
In this paper we will give an overview of the CENEX project and concentrate on the current state of affairs with respect to the ARGUS-software twins. The CENEX (Centre of Excellence) is a new initiative by Eurostat. The main idea behind... more
In this paper, we explore how anonymizing data to preserve privacy affects the utility of the classification rules discoverable in the data. In order for an analysis of anonymized data to provide useful results, the data should have as... more
During the whole process of data mining (from data collection to knowledge discovery) various sensitive data get exposed to several parties including data collectors, cleaners, preprocessors, miners and decision makers. The exposure of... more
We develop a non-parametric imputation method for item non-response based on the wellknown hot-deck approach. The proposed imputation method is developed for imputing numerical data that ensure that all record-level edit rules are... more
We define the disclosure risk scenarios that led to the statistical disclosure control (SDC) methods for the 2001 UK Census. We examine the SDC methods that were implemented based on a disclosure risk-data utility framework and assess... more
An overview of traditional types of data dissemination at statistical agencies is provided including definitions of disclosure risks, the quantification of disclosure risk and data utility and common statistical disclosure limitation... more
Statistical agencies are considering making more use of the internet to disseminate census tabular outputs through flexible table generation servers that allow users to define and generate their own tables. The key questions when... more
This paper provides a review of common statistical disclosure control (SDC) methods implemented at Statistical Agencies for standard tabular outputs containing whole population counts from a Census (either enumerated or based on a... more
Statistical offices are faced with the problem of multiple-database data mining at least for two reasons. On one side, there is a trend to avoid direct collection of data from respondents and use instead administrative data sources to... more
This study examines how the adoption of International Financial Reporting Standard (IFRS) 8, Operating Segments , changed the entity-wide geographic segment reporting by European, Australian and New Zealand blue chip companies. The focus... more
The Five Safes framework is increasingly widely used for data governance. Since its conception in 2003, it has influenced data management in many ways, particularly in the public sector. As it has become established, both the advantages... more
Traditional models of incentivising people suggest that positive incentives are more effective than negative ones. We argue that in data access the opposite can be true, as the assumptions made at the design stage can fundamentally change... more
Download research papers for free!