Juha Karvanen, Sangita Kulathinal and Dario Gasbarra

Optimal designs to select individuals for genotyping conditional on observed binary or survival outcomes and non-genetic covariates

In gene-disease association studies, the cost of genotyping makes it economical to use a two-stage design where only a subset of the cohort is genotyped. At the first-stage, the follow-up data along with some risk factors or non-genetic covariates are collected for the cohort and a subset of the cohort is then selected for genotyping at the second-stage. Intuitively the selection of the subset for the second-stage could be carried out efficiently if the data collected at the first-stage are utilized. The information contained in the conditional probability of the genotype given the first-stage data and the initial estimates of the parameters of interest is being maximized for efficient selection of the subset. The proposed selection method is illustrated using the logistic regression and Cox's proportional hazards model and algorithms that can find optimal or nearly optimal designs in discrete design space are presented. Simulation comparisons between D-optimal design, extreme selection and case-cohort design suggest that D-optimal design is the most efficient in terms of variance of estimated parameters but extreme selection may be a good alternative for practical study design.

Keywords: case-cohort design; case-control design; extreme selection; D-optimal design; greedy method; iterative replacement method; logistic regression; proportional hazards model; selective genotyping; two-stage design