Improve prokaryotic RefSeq genome annotation based on Gpipe and manual curation

Friday, September 18, 2015 — Poster Session IV

12:00 p.m. – 1:30 p.m.
FAES Terrace


  • W Li
  • K O'Neill
  • S Ciufo
  • K Pruitt


With the advancement of DNA sequencing technology, the number of sequenced bacterial genomes have rapidly increased in the last decade. However, these bacterial genomes have been annotated by different groups using distinct annotation programs, which causes a number of issues such as inconsistent protein names, uninformative annotation, and redundant proteins from clonal isolates. The National Center for Biotechnology Information (NCBI) RefSeq database is a collection of non-redundant and curated DNA, RNA, and protein sequences. NCBI has developed a prokaryotic genome annotation pipeline which is offered as a service for prokaryotic genome submissions to GenBank and is used to provide consistently annotated RefSeq prokaryotic genomes. A small dataset of 122 reference genomes that are manually annotated by collaborating groups and NCBI staff is also available. RefSeq has adopted a novel data-model for protein sequence representation that offers a significant reduction in protein redundancy and has additional benefits in terms of managing and updating prokaryotic protein names. RefSeq prokaryotic genome data can be accessed in BLAST databases, web resources (Assembly, BioProject, Genome, Nucleotide, and Protein), can be accessed using NCBI’s programming utilities, or can be downloaded from the genomes or refseq FTP sites.

Category: Genetics and Genomics