We present CheXtriev,a graph-based,anatomy-aware framework for chest radiograph retrieval. Unlike prior methods focussed on global features, our method leverages graph transformers to extract informative features from specific anatomical regions.
Furthermore, it captures spatial context and the interplay between anatomical location and findings. This contextualization, grounded in evidence-based anatomy, results in a richer anatomy-aware representation and leads to more accurate, effective and efficient retrieval, particularly for less prevalent findings.
CheXtriv outperforms state-of-the-art global and local approaches by 18% to 26% in retrieval accuracy and 11% to 23% in ranking quality.
CheXtriev first extracts informative features from anatomically defined regions and constructs a graph where nodes represent regions and edges capture spatial relationships and potential co-occurrences between findings.
The graph leverages learnable location and edge embeddings to capture spatial context and relationships between regions, allowing a graph transformer architecture with global edge-aware attention and gated residuals to learn robust global-local context representations for accurate, effective and efficient retrieval.
The figure shows the retrieval performance of different approaches and saliency map analysis for a sample query image with enlarged cardiac silhouette (ECS). Each row displays the top 4 retrieved images and their corresponding occlusion-based saliency maps generated by CheXtriev, AnaXNet, ATH and Global CNN (top to bottom). A retrieved image is considered correct if it matches the specific finding of interest, in this case ECS, as the query image.
The first three retrieved images by CheXtriev all exhibit ECS, with the saliency maps highlighting a focus on the cardiac silhouette region. However, in the saliency map for the fourth retrieved image, the model’s attention diffused throughout the lung region, indicating an incorrect retrieval. Some images retrieved by AnaXNet lack any findings for ECS. Even though ATH and CNN manage to retrieve correct cases, the corresponding saliency maps highlight irrelevant regions.
We compare the performance results of CheXtriev, on MIMIC-CXR dataset against global baselines (ResNet50, ATH) and a local variant (AnaXNet). A student’s t-test was done to establish the statistical significance of the difference in results between CheXtriev and the baselines, and the significant values (p < 0.05) are marked with an asterisk.
Relative to both the global baselines, CheXtriev has higher AP values across all nine investigated findings, with these ranging from 91.7% (LO) to 28.8% (FO/HF). This trend also holds for HR and RR metrics.
The mAP improvement over the baselines ranges from at least 12% (first column) to a notable 23% (second column). The noteworthy point is that the improvement is quite significant for classes with lower prevalence, such as FO/HF (+82.3% to +94.6%), PTX (+40.6% to +253.9%), CONS (+30.6% to +35.6%), PN (+14.6% to +20.5%), demonstrating CheXtriev’s ability to learn more discriminative and powerful visual representations compared to global methods.
We also compare CheXtriev against AnaXNet, a SOTA local approach designed for classification tasks. CheXtriev exhibited significant improvements in mean AP, particularly for findings associated with known blind spots in chest radiographs, such as lung apices, hilar structures and inferior lung bases (for example, FO/HF +61.80%, PE/HO +16.82%, CONS +14.49% higher). The results also indicate classification-optimized features may not be the most effective for retrieval.
Various variants of the proposed solution, namely, CheXtriev, were constructed to assess the contributions of various components. Here, IRM and MLF denote inter-anatomic region modelling using graph transformers and multi-level features with gated residuals, respectively.
The consistent performance boost observed from V1 to V6 (in global MLF), ranges from +2.1% (in mAP for V1) to +6.4% (for V6); this underscores the advantage of graph transformers with fully connected unique learnable edges over uniform edge-sharing schemes and naive handcrafted adjacency. This result also suggests that the retrieval task benefits from considering all region pairs, potentially capturing intricate latent relationships between regions.
It can be seen that local gated residual connections (V4) lead to a significant drop in performance (6.38% lower mAP and 6.89% lower mRR) relative to a global one (V6). This emphasizes the value of global gated residual connections with selective refinement for learning multi-level features. The fact that V6 and V3 outperform V5 and V2, respectively, suggests that the learnable location embedding improves model performance by capturing crucial spatial context for accurate ranking.