Statistical meaning of 2-D and 3-D molecular similarity scores used in PubChem

Thursday, October 11, 2012 — Poster Session IV

2:00 p.m. – 4:00 p.m.

Natcher Conference Center, Building 45




  • S. Kim
  • E.E. Bolton
  • S.H. Bryant


PubChem is a public repository for biological activities of small molecules. It archives biological screening data and other chemical information from various data sources and offers its contents free of charge to biomedical research community, facilitating the discovery of drugs and chemical probes. For the efficient use of its enormous amount of chemical information, PubChem provides various search and analysis tools, many of which exploit the concept of molecular similarity at some level. Although molecular similarity methods, including those used in PubChem, are routinely used for the analysis of biological data and virtual screening, little has been known about the statistical meaning of similarity scores from these methods. To address this issue, the similarity value distribution curves for randomly selected compounds were generated using 2-D and 3-D molecular similarity methods utilized by PubChem. An attempt was also made to explore the question of whether it was possible to realize a statistically meaningful similarity value separation between reputed biological assay actives and inactives. In addition, the complementarity between PubChem’s 2-D and 3-D similarity methods was investigated. This work is a critical step to create a statistical framework to build upon, helping to develop more reliable ligand-based virtual screening approaches.

