Hugging Face, the AI startup, has introduced a new benchmark named Open Medical-LLM, designed to evaluate generative AI models on medical-related tasks.
This initiative, developed in collaboration with Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, incorporates existing test sets like MedQA, PubMedQA, and MedMCQA. These tests probe generative AI’s knowledge in areas such as anatomy, pharmacology, genetics, and clinical practice.
The benchmark includes a variety of question types that test medical reasoning, drawing from resources like U.S. and Indian medical licensing exams and college biology test banks.
The release of Open Medical-LLM comes at a time when the adoption of generative AI in healthcare is increasing, though with a mixture of enthusiasm and caution.
The Dual Perspectives on AI in Healthcare
Proponents of generative AI in healthcare believe it can enhance efficiency and uncover insights that might otherwise remain undiscovered. However, critics argue that these models carry inherent flaws and biases that could potentially lead to poorer health outcomes.
Hugging Face has positioned Open Medical-LLM as a robust tool for assessing the capabilities of healthcare-oriented generative AI models.
According to a blog post by the company, the benchmark allows researchers and practitioners to pinpoint the strengths and weaknesses of different AI approaches. This, in turn, is intended to drive advancements in the field and improve patient care and outcomes.
Yet, despite these intentions, some medical professionals express skepticism about the benchmark’s real-world applicability.
Medical Professionals Weigh In
Liam McCoy, a resident physician in neurology at the University of Alberta, highlighted on social media platform X, the discrepancy between the controlled environment of a benchmark test and the complexities of actual clinical practice. ClĂ©mentine Fourrier, a research scientist at Hugging Face and co-author of the benchmark announcement, concurred with McCoy’s viewpoint.
She emphasized that while such benchmarks can guide initial model selection for specific use cases, a more comprehensive phase of testing is crucial to evaluate a model’s limitations and applicability in real-world conditions. Fourrier further advised that medical generative AI models should not be used independently by patients but should serve as support tools for medical professionals.
Learning from Past AI Implementations in Healthcare
The conversation around generative AI in healthcare also recalls past experiences with AI tools in medical settings.
An illustrative example is Google’s AI screening tool for diabetic retinopathy in Thailand. Although it showed high theoretical accuracy, the tool was impractical in real-world applications, leading to frustration among patients and healthcare workers due to inconsistent results and lack of integration with existing medical practices.
The challenges of implementing AI in healthcare are further underscored by regulatory considerations. To date, the U.S. Food and Drug Administration (FDA) has approved 139 AI-related medical devices, none of which employ generative AI.
This highlights the difficulty in predicting how well AI tools developed and tested in laboratory settings will perform in clinical environments and how their effectiveness might evolve over time.
Related News:
Featured Image courtesy of ipopba via Getty Images/iStockphoto