NIH Research Festival
–
Transformer-based large vision models (LVMs) have the enormous capacity to locate visual patterns and capture semantic information from images. In this study, we applied several explainable AI (XAI) techniques including heatmaps of attention heads, mean attention distance, and self-attention span to explore the internal mechanisms of LVMs in capturing disease-significant patterns. For experiments, the data comprised of chest X-rays (CXR) retrieved from MIDRC. The results show that the LVM pretrained by large image datasets and finetuned by a small medical image dataset can effectively capture both spatially distant dependencies and local dependencies of visual tokens extracted from the images. The finding is confirmed by the prediction evaluation performance where the LVM achieved the mean squared error (MSE) of 23.83 (95% CI 22.67-25.00) and mean absolute error (MAE) of 3.64 (95% CI 3.54-3.73) in cross-validation. It outperformed the ResNet-50 model MAE using the convolutional neural network (CNN) for pattern extraction by 64.3% for MSE and 27.9% for MAE.
Additionally, LVM achieved the prediction correlation with human medical experts with an R-squared of 0.81 (95% CI 0.79-0.82) and Spearman ρ of 0.80 (95% CI 0.77-0.81), which are comparable to the current state-of-the-art methods that were trained on much larger medical image datasets. The XAI techniques help interpret the performance and behaviors of LVMs by analyzing the token relations reflected by the weights of multiple self-attention heads and visualizing them with different methods. It offers a path for designing optimized data-driven medical task-specific LVMs.
Scientific Focus Area: Computational Biology
This page was last updated on Tuesday, August 6, 2024