ReplicaX is a method to produce data replicas from data that cannot be published.

ReplicaX won the challenge Utilization of health data in Apps4Finland 2013 competition.

- ReplicaX page in Apps4Finland 2013 http://www.apps4finland.fi/kilpailutyo/replicax/ is not available anymore but you can access an archieved version (in Finnish)
- Presentation on ReplicaX in Finnish (based on Apps4Finland 2013 version)
- Description (in Finnish) of the development work in Datademo at spring 2014.
- Example of diagnostics report created in Datademo: ReplicaX_report_MONICA.html
- Blog text (in Finnish) describing the motivation for ReplicaX and the development done in Datademo:

A white paper in English will be available later.

ReplicaX_functions.r | R functions |

ReplicaX_template.Rmd | R markdown template needed for the diagnostics report |

example_MONICA.r | Example with the MONICA data |

example_tax.r | Example with the data on the income and the taxes of companies in Finland 2012 |

yhteiso_tuloverotus_2012.Rdata | The data used by example_tax.r |

**Dataset**User given name for the dataset**Date**Date and time when ReplicaX was called**Number of observations in the original data**Number of rows in the original data frame**Number of observations in the replica data**Number of rows in the replica data frame**Number of pruned observations**Number of observations that were considered to potentially risky for the confidentiality and therefore removed from the replica data.**Number of variables**Number of columns in the replica data frame**ReplicaX version**Version number of the R code**Random seed**Seed of the random number generator. Can be used to replicate the process.**Resampled**If FALSE, each observation in the original data has been used once as the starting value for the generation of the replica data. If TRUE, the starting values are sampled with replacement from the observations in the original data (bootstrapping).**Pruned**If TRUE, potentially risky observations were removed.**Diagnostics run**If TRUE, the diagnostic tests have been carried out.**Parameter k**Number of nearest neighbors used in the generation of discrete variables. Larger values mean more mixing.**Parameter osigma 0.5**Standard deviation of the noise term in Gaussian domain used in the generation of continuos variables. Larger values mean more mixing.**Call in R**How the R function was called.

**Number of matched observations**: It is checked for each observation in the replica if the corresponding observation in the original data is the closest according to a similarity measure. This called a match and the item gives the number of these matched observations.**Proportion of matched observations**: equals to number of matched observations divided by the total number of observations in the replica.**Number of observations that should be removed to improve confidentiality**: The pruning (see Metadata above) criterion is applied to the replica and the number of observations to be pruned is returned. Note that this number is zero if pruning has been already applied (the default).**Average similarity for matched observations**: The average of the similarity measure for the matched observations in the replica.**Average similarity for non-matched observations**: The average of the similarity measure for the non-matched observations in the replica.**Maximum similarity for matched observations**: The maximum of the similarity measure for the matched observations in the replica.**Maximum similarity for non-matched observations**: The maximum of the similarity measure for the non-matched observations in the replica.**Similarity for exact match**: The value of the similarity measure if all values are the same for an observation and its mathced observation in the original data. Equals the number of variables.

**Maximum difference in correlations**: Correlation matrices are calculated for both the replica and the original data between all continuous variables. This item equals the maximum element of the absolute difference of the two correlation matrices.**Variables for the maximum difference in correlations**: The variables for which the maximum difference above has been obtained.**Correlations: original, replica**: The correlation coefficient for the variables on the previous row.**Maximum difference in marginal distributions**: The maximimum difference between the replica and the original data is calculated for each variables. For continuous variables, the difference is the Kolmogorov-Smirnov statistics. For discrete variables the difference is the maximum difference of the class probabilities. The reported number is the maximum over all variables.**Variable for the maximum difference in marginal distributions**: The variable for which the maximum in the previous row is found.

- MONICA example in Apps4Finland 2013: R code | presentation
- MONICA example in Datademo 2014: R code | diagnostics
- Tax example in Datademo 2014: R code | diagnostics