NIH Research Festival
Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models (e.g. non-linear interaction effects), which likely contribute to the heritability of complex human traits. Machine learning methods, such as Random Forests (RF), are an alternative approach that may be more optimal for identifying these effects. One caveat to RF is that there is no standardized method of selecting variables with a low false positive rate (FPR) while retaining power. We have developed a method called r2VIM, which incorporates recurrency and variance estimation into RF to guide optimal threshold selection. We assess how r2VIM performs in simulated data with complex effects (multiple loci with interactions and main effects). Our findings indicate that the optimal threshold can identify interactions with adequate detection power and a low FPR. For example, the optimal VIM threshold had an average detection power of 0.80 and an average FPR of 0.11 for a model with a two-locus interaction and no main effects. However, the optimal threshold is highly dependent on the simulated genetic model, which is unknown in biological data. To address this, we permute the phenotype and re-run r2VIM to generate a null distribution of VIMs. The results are used to choose a threshold in the non-permuted analysis by comparing FPR estimates at different VIM thresholds. Our initial results show that the best balance between FPR and detection power is produced by selecting the VIM threshold with an FPR of close to 0.05 in the permuted data.
Scientific Focus Area: Genetics and Genomics
This page was last updated on Friday, March 26, 2021