Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Eric O. Johnson; Grier Philip Page; Joshua Lerner Levy; Dana Bowling Hancock; Nathan Clay Gaddis; Eric Otto Johnson; Dana Bowling Hancock; Joshua Lerner Levy; Nathan Clay Gaddis; Nancy L Saccone; LJ Bierut; Grier Philip Page

Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Johnson, E., Hancock, D., Levy, J., Gaddis, N., Saccone, NL., Bierut, LJ., & Page, G. (2013). Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy. Human Genetics, 132(5), 509-522. https://doi.org/10.1007/s00439-013-1266-7

Copy citation

Abstract

A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality