ReplicaX logo

Introduction

ReplicaX is a method to produce data replicas from data that cannot be published.

ReplicaX won the challenge Utilization of health data in Apps4Finland 2013 competition.

A white paper in English will be available later.

Download

ReplicaX.zip from www.tilastotiede.fi

The archive contains the files:
ReplicaX_functions.r R functions
ReplicaX_template.Rmd R markdown template needed for the diagnostics report
example_MONICA.r Example with the MONICA data
example_tax.r Example with the data on the income and the taxes of companies in Finland 2012
yhteiso_tuloverotus_2012.Rdata The data used by example_tax.r

Diagnostics: explanation of the items in the report

Metadata:

  • Dataset User given name for the dataset
  • Date Date and time when ReplicaX was called
  • Number of observations in the original data Number of rows in the original data frame
  • Number of observations in the replica data Number of rows in the replica data frame
  • Number of pruned observations Number of observations that were considered to potentially risky for the confidentiality and therefore removed from the replica data.
  • Number of variables Number of columns in the replica data frame
  • ReplicaX version Version number of the R code
  • Random seed Seed of the random number generator. Can be used to replicate the process.
  • Resampled If FALSE, each observation in the original data has been used once as the starting value for the generation of the replica data. If TRUE, the starting values are sampled with replacement from the observations in the original data (bootstrapping).
  • Pruned If TRUE, potentially risky observations were removed.
  • Diagnostics run If TRUE, the diagnostic tests have been carried out.
  • Parameter k Number of nearest neighbors used in the generation of discrete variables. Larger values mean more mixing.
  • Parameter osigma 0.5 Standard deviation of the noise term in Gaussian domain used in the generation of continuos variables. Larger values mean more mixing.
  • Call in R How the R function was called.

Diagnostics on confidentiality:

  • Number of matched observations: It is checked for each observation in the replica if the corresponding observation in the original data is the closest according to a similarity measure. This called a match and the item gives the number of these matched observations.
  • Proportion of matched observations: equals to number of matched observations divided by the total number of observations in the replica.
  • Number of observations that should be removed to improve confidentiality: The pruning (see Metadata above) criterion is applied to the replica and the number of observations to be pruned is returned. Note that this number is zero if pruning has been already applied (the default).
  • Average similarity for matched observations: The average of the similarity measure for the matched observations in the replica.
  • Average similarity for non-matched observations: The average of the similarity measure for the non-matched observations in the replica.
  • Maximum similarity for matched observations: The maximum of the similarity measure for the matched observations in the replica.
  • Maximum similarity for non-matched observations: The maximum of the similarity measure for the non-matched observations in the replica.
  • Similarity for exact match: The value of the similarity measure if all values are the same for an observation and its mathced observation in the original data. Equals the number of variables.

Diagnostics on statistical properties:

  • Maximum difference in correlations: Correlation matrices are calculated for both the replica and the original data between all continuous variables. This item equals the maximum element of the absolute difference of the two correlation matrices.
  • Variables for the maximum difference in correlations: The variables for which the maximum difference above has been obtained.
  • Correlations: original, replica: The correlation coefficient for the variables on the previous row.
  • Maximum difference in marginal distributions: The maximimum difference between the replica and the original data is calculated for each variables. For continuous variables, the difference is the Kolmogorov-Smirnov statistics. For discrete variables the difference is the maximum difference of the class probabilities. The reported number is the maximum over all variables.
  • Variable for the maximum difference in marginal distributions: The variable for which the maximum in the previous row is found.

Examples

Author

Juha Karvanen

http://www.tilastotiede.fi/juha_karvanen.html