WikiAnesthesia Chatbot STA Abstract Jan 2024

AnesthesiaChat: An Anesthesia-Specific Large Language Model Built from WikiAnesthesia

Alex J Goodell,¹ Philip Chung,¹ Larry F Chu,¹ Barrett Larson,¹^,² Simon N Chu,³ Chris Rishel¹^,²

¹ Department of Anesthesia, Pain, and Perioperative Medicine, School of Medicine, Stanford University ² WikiAnesthesia Foundation, San Jose, CA ³ Department of Surgery, School of Medicine, University of California San Francisco

Introduction: Large language models like ChatGPT are poised to transform anesthesia and medicine as a whole, offering vast potential for application within both medical education and clinical practice. These models have shown impressive abilities in answering general medicine questions, despite lack of significant training in medical literature.^1, However, few studies have assessed their capability in anesthesiology. To improve the performance of question answering in a specialized domain, scientists in many non-medical domains have developed methods for subtly altering the structure of their query to improve model output; this has been termed prompt engineering. One of the more sophisticated methods for prompt engineering, so-called retrieval-augmented generation, incorporates a search against a knowledge base to find the most relevant data and provides a compressed version to the model as a form of reference material. In this study, we describe the development of a retrieval-augmented chatbot built with the OpenAI assistant framework. To our knowledge, this is the first described anesthesia-specific chatbot.

Methodology: A Python program was written to extract all textual data from the WikiAnesthesia website. In addition, practice guidelines from various national anesthesia organizations and societies were downloaded and their textual data extracted in Python. This program was then connected with the OpenAI assistant API, with GPT4-preview-1106 serving as the underlying model. A guide to anesthetic planning and utilization of provided resources was developed and provided to the model. We then performed a series of qualitative and quantitative assessments comparing the augmented model with the non-augmented model. Assessments included performance on 60 sample BASIC exam questions and anesthetic plans generated for ten clinical vignettes, which were reviewed by two anesthesiologists.

Results: Performance of the model on 60 exam questions was similar between the augmented and non-augmented models over a series of three runs. The augmented model was correct 56% of attempts while the non-augmented model was correct 62% of attempts (chi-squared 1.15, p-value 0.2841). The augmented model performed superiorly on the task of anesthetic planning, providing answers which were more complete and readable. They were noticeably more dense and shorter, with an average of 1,300 characters in the augmented model (95% CI 540 - 2050) compared to 4,200 characters (95% CI 3200 – 5100, p<0.01) in the non-augmented model.

Conclusion: In this study, we demonstrate the feasibility of creating a model that has access to specialized anesthesiology knowledge. Surprisingly, the addition of this knowledge did not significantly enhance the model’s ability to answer boards-style questions from the ABA BASIC exam. We hypothesize that the augmented model may over-rely on provided resources, which focus on clinical practice of anesthesia, and may have not added additional useful information for an exam focused on basic science and foundational knowledge. Our experiments also reveal that the base GPT4 model has sufficient capability to answer many ABA Basic exam questions without augmentation. Additionally, since the multiple-choice questions chosen are available online, it is possible that they were included as training data for GPT4, a phenomenon known as “contaminated data.” However, the improved performance of the augmented model on open-ended anesthetic planning provides evidence that augmented models can successfully integrate and apply specialized information to complex tasks such as anesthetic planning. Future directions should study alternative strategies for prompt engineering, such as chain-of-thought, self-consistency, least-to-most, or the variety of other strategies under active exploration.² More advanced research could focus on the incorporation of curated knowledge graphs, utilization of autonomous agents to iterate on anesthetic plans, or development of models capable of using tools such as medical calculators. Overall, our project provides evidence that development of anesthesia-specific language models is feasible and serves as a first step towards building the future of AI-assisted anesthesia.

References

Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health 2, e0000198 (2023).
Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 25, e50638 (2023).

Top contributors: WikiAnesthesia Bot