Assessing single-nucleotide polymorphism using the KPGP-38 Human Genomes next-generation sequencing data from CAMDA

Friday, November 08, 2013 — Poster Session IV
2:00 p.m. – 4:00 p.m.	FAES Academic Center (Upper-Level Terrace)	FDA/CBER	COMPBIO-19

Authors

V Soika
W Zhang
J Shen
J Meehan
Z Su
W Ge
H Fang
R Perkins
H Hong
W Tong
V Simonyan

Abstract

Next generation sequencing (NGS) data analysis field is challenging due to massive data sizes. Fast and accurate data analysis is required to leverage the potential of the NGS data. Short read alignment and alignment mapping required to call single-nucleotide polymorphisms (SNPs) has become the preferred technology in current genetic studies. The Critical Assessment of Massive Data Analysis (CAMDA) consortium hosts the KPGP-38 Human Genomes NGS data. Importantly, this dataset’s high coverage and the inclusion of two different sets of twins and a Caucasian female provides a suitable opportunity to explore quality control metrics for improving accuracy in SNP and genotype calling and to investigating aspects of population genetics. We used the High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for the storage and analysis of extra-large data, primarily Next Generation Sequencing (NGS) data to align CAMDA data to whole human genome. Then HIVE has been used to build a profile of the alignments in relation to the genome references and to call SNPs for the CAMDA data. The SNP data has been further analyzed taking into account twin pairs KPGP88/KPGP89 and KPGP90/KPGP91 which allowed to perform SNPs assessment of the whole data set.