NIH Research Festival
The use of Big Data in healthcare has focused attention on data quality and fitness for analysis. For example, for the NIH’s precision medicine initiative cohort (1 million patients) the plan considers a retrospective database that stiches together as many data sources as possible. However, the resulting dataset represents only a partial picture of the real world because completeness is limited by the characteristics of the participating sources. We used a set of five patient-level databases (4 to 141 million patients from the Reagan Udall Foundation) to evaluate data quality. Our analysis was stratified by 5-year intervals (e.g., the interval ‘R60’ refers to records of patients aged 60 to 64) and by record subdomains (e.g., procedures or laboratory results/claims). For each age group and subdomain, we generated an average patient record archetype that contains events typically found in a comprehensive lifetime health record found in the largest proportion of each sub-cohort. For example, in the MarketScan Claims database (141M patients), 92.9% of R60 records contain a lipid panel, TSH or CBC laboratory claim. Using the largest database as a silver standard to generate average archetypes, we found great differences among the remaining 4 datasets that revealed limited completeness of some databases for many highly expected events (e.g., OB/GYN checks, vision screenings, chest X-rays). Top 100 events by domain are available at http://1drv.ms/1y5z5WC. Comparing average archetypes across databases or applying the same (silver standard) archetype to different databases can help define a database’s usefulness for research and reveal data stitching processes.
Scientific Focus Area: Epidemiology
This page was last updated on Friday, March 26, 2021