Exploring alternative domain architectures as a way to improve annotation consistency within the Conserved Domain Database (CDD)

Wednesday, September 24, 2014 — Poster Session IV
10:00 a.m. –12:00 p.m.	FAES Academic Center	NLM	STRUCTBIO-2

* FARE Award Winner

Authors

MK Derbyshire
NR Gonzales
S Lu
J He
Z Wang
SH Bryant
A Marchler-Bauer

Abstract

NCBI’s CDD is a collection of protein domain models, collected as multiple sequence alignments and converted into position-specific score matrices. It uses RPS-BLAST to match protein sequences with these models. CDD includes imported models (Pfam, TIGRFAMs and others) as well as finer-grained hierarchical classifications, based on phylogenetic analysis, for selected domain families curated by NCBI staff. CDD supports a live search service for protein and nucleotide queries, as well as pre-computed domain and site annotation for the majority of protein sequences tracked by NCBI’s Entrez system. For both, a default RPS-BLAST E-value (reporting) threshold is applied. Here we examine, whether collecting additional search-database hits obtained at a raised E-value threshold can uncover domain architectures that are common enough to provide a viable alternative to architectures assigned with the default E-value reporting threshold. We also examine whether suppressing annotation with E-values close to the reporting threshold can be effective in removing rare and unlikely domain architectures. This work was supported by the Intramural Research Program of the NIH, National Library of Medicine.

2014 program

Exploring alternative domain architectures as a way to improve annotation consistency within the Conserved Domain Database (CDD)

Authors

Abstract