-
- GMAI-MMBench
+
+ SlideChat
- A Comprehensive Multimodal
- Evaluation Benchmark Towards General Medical AI
+ A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
- Pengcheng Chen*1,2,
- Jin Ye*†,1,3,
- Guoan Wang*1,4,
- Yanjun Li1,4,
- Zhongying Deng5,
- Wei Li1,6,
+ Ying Chen*1,2,
+ Guoan Wang*1,3,
+ Yuanfeng Ji*4,
+ Yanjun Li1,3,
+ Jin Ye1,5,
Tianbin Li1,
- Haodong Duan1,
- Ziyan Huang1,6,
- Yanzhou Su1,
- Benyou Wang7,8,
- Shaoting Zhang1,
- Bin Fu9,
- Jianfei Cai3,
- Bohan Zhuang3,
- Eric J Seibel2,
- Junjun He†,1,
- Yu Qiao†,1,
+ Bin Zhang6,
+ Nana Pei6,
+ Rongshan Yu2,
+ Yu Qiao1,
+ Junjun He†,1,
1Shanghai AI Laboratory,
- 2University of Washington,
- 3Monash University,
- 4East China Normal University,
- 5University of Cambridge,
- 6Shanghai Jiao Tong University,
- 7The Chinese University of Hong Kong, Shenzhen
- 8Shenzhen Research Institute of Big Data
- 9Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences
+ 2Xiamen University,
+ 3East China Normal University,
+ 4Stanford University,
+ 5Monash University,
+ 6The First Affiliated Hospital of Jinan University
@@ -219,13 +208,25 @@
🔔News
- 🚀[2024-09-26]: Accepted by NeurIPS 2024 Datasets and Benchmarks Track!🌟
+ 🚀[2024-10-24]: We realease our paper!🌟
+
+
+ Key contributions
+
+
+
+ - We create SlideInstruction, a largest comprehensive WSI instruction-following dataset containing 4.2K WSI-caption pairs and 176K VQA pairs.
+ - We develop SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, achieving state-of-the-art performance on multiple benchmarks.
+ - We establish SlideBench, a WSIs multi-modal benchmark comprising SlideBench-Caption, SlideBench-VQA (TCGA), and SlideBench-VQA (BCNB), covering 21 different clinical tasks.
+ - We will release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.
+
-
+
Abstract
- Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53.96%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.
+ Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patchlevel analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instructionfollowing dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal
+ benchmark that incorporates captioning and VQA tasks to assess SlideChat’s capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities, achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.