Skip to main content

An algorithm to map free-text job descriptions in epidemiological surveys to standard occupational codes

Thursday, October 11, 2012 — Poster Session III

10:00 a.m. – Noon

Natcher Conference Center, Building 45




  • S Butcher
  • M.C. Friesen
  • W.W. Lau
  • C.A. Johnson


Epidemiological studies of industrial chemical exposures in population-based studies often rely on job title information that is collected using the participants’ verbatim responses. Currently, these free-text job titles are manually assigned into Standard Occupational Codes (SOC) for downstream analysis. The goal of this project is to develop machine learning techniques to automate the SOC assignments to alleviate the burden on the epidemiologists and exposure assessors. We analyzed 13,317 unique job description entries from the Integration Management Information System (IMIS), a database of measurements collected by Occupational Safety and Health Administration inspectors. A major challenge of this task is the very ambiguous nature of the content in part due to the short text length and often abbreviated and/or misspelled words. We use a k-nearest-neighbor (kNN) classifier with SOC definitions and the associated job title synonyms from the Occupational Network (O*Net) database as exemplars to identify potential SOC matches. Using this algorithm, a prototype was developed to help users select the correct mapping from a list of SOC candidates given a free-text job description. Once enough annotations have been collected, a more robust classifier can be trained to further automate the mapping process with higher accuracy than what we can achieve now.

back to top