Skip to main content

NLP-derived information improves the estimates for disease risk compared to estimates based on manually extracted data alone.

Tuesday, October 09, 2012 — Poster Session I

1:00 p.m. – 3:00 p.m

Natcher Conference Center, Building 45




  • F.M. Callaghan
  • M.T. Jackson
  • D Demner-Fushman
  • S Abhyankar
  • C.J. McDonald


Clinical narrative notes have the potential to be an important source of information. They must, however, be structured before being analyzed. Medical abstractors manually extract variables from unstructured text, but this is a time- and labor-intensive process, and is not practical when large quantities of text must be analyzed. Natural language processing (NLP) is an automated or partially-automated collection of methods that can assist abstractors, even replacing them in some cases. The state-of-the-art performance of NLP has recall and precision of approximately 80-95%. However, most statistical methods assume that variables are known without any error; when variables that have been measured with error are used, the methods are prone to bias, loss of power, and other problems. We propose a statistical method new to the NLP literature which can be used to adjust for measurement error. Via simulations, we show that the resulting estimates are often unbiased and the associated tests are more powerful than those associated with unadjusted methods. We apply our method to the problem of using smoking status (derived via NLP from clinical notes) to estimate the increased risk of smoking-related cancers. Our method estimates a risk of cancer that is 40% higher than unadjusted NLP methods.

back to top