UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

Surgical-VQLA++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery

Bai, Long; Wang, Guankun; Islam, Mobarakol; Seenivasan, Lalithkumar; Wang, An; Ren, Hongliang; (2025) Surgical-VQLA++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery. Information Fusion , Article 102602. 10.1016/j.inffus.2024.102602.

[thumbnail of 1-s2.0-S1566253524003804-main.pdf] Text
1-s2.0-S1566253524003804-main.pdf - Accepted Version
Access restricted to UCL open access staff until 28 January 2026.

Download (4MB)

Abstract

Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes. Our code and data will be released at https://github.com/longbai1006/Surgical-VQLAPlus .

Type: Article
Title: Surgical-VQLA++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery
DOI: 10.1016/j.inffus.2024.102602
Publisher version: http://dx.doi.org/10.1016/j.inffus.2024.102602
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Surgical education, vision-language embedding, adversarial contrastive learning, image corruption, visual-question answering
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Med Phys and Biomedical Eng
URI: https://discovery-pp.ucl.ac.uk/id/eprint/10195335
Downloads since deposit
30Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item