NIH Research Festival
–
–
FAES Terrace
NHGRI
GEN-11
Intro: The Undiagnosed Diseases Program (UDP) enrolls participants with undiagnosed disorders despite past extensive clinical evaluation. The UDP utilizes genomic analysis optimized for detecting variants that may be missed by standard analyses. This high-sensitivity approach generates many false positives. Methodologies for prioritizing results include identifying short read misalignments. We hypothesize that current mapping and genotype scores do not adequately capture alignment quality for individual variants. To address this question, we are developing a machine learning based tool to rank variants based on the quality of the alignment associated with variants.
Methods: We are generating a list of alignment characteristics, building on those in existing tools such as the GATK pipelines. These characteristics are being built into a random forest classifier. This model will be trained and tested with a combination of highly characterized genomes (i.e. Genome in a Bottle), synthetic genomes and a set of 7748 hand-curated variants from prior UDP evaluations.
Results: Initial assessment of potential model classifiers has demonstrated marked operator bias in hand-curated datasets. Evaluation of a heuristic alignment filtration system from a prior project suggests that some specific alignment patterns, such as >2 haplotypes covering the called variant, provide information that is not captured by traditional quality score filters.
Conclusion: We present preliminary data for in progress work on a machine learning classifier designed to assist with the prioritization of results in noisy short read variant data sets. Our hope is that this work will prompt discussion and feedback that will be useful during tool development.
Scientific Focus Area: Genetics and Genomics
This page was last updated on Monday, September 25, 2023