MedReason-Dx: Benchmarking Step-by-Step Reasoning of Language Models in Medical Diagnosis

Abstract

In high-stakes domains like medicine, how an AI arrives at an answer can be as critical as the answer itself. However, existing medical question answering benchmarks largely ignore the reasoning process, evaluating models only on final answer accuracy. This paper addresses the overlooked importance of reasoning path evaluation in medical AI. We introduce MedReason-Dx, a novel benchmark that assesses not just answers but the step-by-step reasoning behind them. MedReason-Dx provides expert-annotated step-by-step solutions for both multiple-choice and open-ended questions, spanning 24 medical specialties. By requiring models to produce and be evaluated on intermediate reasoning steps, our benchmark enables rigorous testing of interpretability and logical consistency in medical QA. We present the design of MedReason-Dx and outline diverse evaluation metrics that reward faithful reasoning. We hope this resource will advance the development of robust, interpretable medical decision support systems and foster research into large language models that can reason as well as they respond.

Artificial intelligence systems for healthcare must not only deliver correct answers but also provide justifiable reasoning. In clinical decision support and medical question answering (QA), the reasoning path that leads to an answer is critical for trust and safety. A model that arrives at a diagnosis by flawed logic or guesswork poses significant risks, even if the final answer is correct. Conversely, a model that explains its reasoning can enable practitioners to verify each step, ensuring the conclusion is sound. Despite this, most existing benchmarks in medical AI evaluate models solely on whether the final answer is right, with little or no assessment of the reasoning process. This gap is problematic in high-stakes domains: evaluating only end answers may overlook dangerous reasoning errors and fails to encourage the development of models that “think” in a human-like, transparent manner. Recent advances in large language models (LLMs) and prompting techniques have brought reasoning to the forefront of AI research. In particular, chain-of-thought (CoT) prompting has demonstrated that LLMs can generate step-by-step solutions to complex problems, from math and logic puzzles to medical questions. By prompting models to articulate intermediate steps, researchers have achieved improved performance on challenging tasks and gained insight into model decision-making. For example, state-of-the-art medical LLMs can now produce explanations or rationales alongside their answers, showcasing the potential of AI to handle intricate clinical reasoning. These developments underscore an urgent need for benchmarks that can evaluate not just final accuracy but the quality of reasoning LLMs employ. If a model is prompted to reason but we lack ground truth reasoning paths for comparison, we cannot rigorously assess whether the model’s reasoning is correct, complete, or clinically valid. Several medical QA datasets and benchmarks have emerged, yet they predominantly focus on answer correctness. Standard benchmarks drawn from medical exams (e.g., USMLE-style question banks, MedQA and MedMCQA) and research datasets like PubMedQA have driven progress in factual recall and question answering. Some of these resources include a short explanation or reference for the answer, but they do not provide a detailed, stepwise reasoning chain that could be used to evaluate a model’s thought process. In other words, existing benchmarks treat reasoning as an implicit skill, not an explicit target of evaluation. A model might earn full marks by selecting the correct option in a multiple-choice question, while in reality it could have arrived at that answer via incorrect assumptions or lucky guesswork. Conversely, a model might demonstrate mostly correct reasoning and make a minor error at the final step, but current benchmarks would simply mark the entire answer as wrong, offering no credit for nor analysis of the model’s reasoning ability. This limitation hampers the development of robust medical AI: it is difficult to discern whether improvements in accuracy are due to better reasoning or just better pattern matching, and it provides no incentive for models to output interpretable solutions. To address these challenges, we propose MedReason-Dx, a new benchmark expressly designed to evaluate chain-of-thought reasoning in medical question answering. MedReason-Dx (where “CoT” denotes Chain-of-Thought) introduces several key innovations to the evaluation of medical AI:

In summary, MedReason-Dx is the first benchmark to comprehensively target reasoning path evaluation in medical QA. It offers the community a testbed to develop and rigorously vet models that aim to be not just answer engines, but reliable reasoning assistants for healthcare. We describe the construction of MedReason-Dx, including the data collection and expert annotation process, and provide an analysis of its contents. We also outline an evaluation framework and baseline results using current LLMs (without revealing any performance outcomes here). By emphasizing how answers are derived, our work addresses a critical gap for high-stakes AI: the need for systems whose decisions can be inspected and trusted. We believe MedReason-Dx will facilitate research into interpretable and robust medical AI, ultimately contributing to safer and more effective clinical decision support tools.

Related Works

Medical LLMs

The evolution of medical large language models (Med-LLMs) has led to advancements in model architectures, training paradigms, and domain-specific adaptations, enabling applications in information extraction, clinical decision support, dialogue systems, and multimodal medical AI.

Early Med-LLMs, such as BioBERT (Lee et al. 2020) and PubMedBERT (Gu et al. 2021), were trained on extensive biomedical literature and PubMed abstracts, excelling in tasks like named entity recognition, relation extraction, and text classification. Models such as ClinicalT5 (Lu, Dou, and Nguyen 2022) and GatorTron (X. Yang et al. 2022) extend this capability to clinical text summarization and report generation, while Codex-Med (Liévin et al. 2024) specializes in structured medical documentation. Galactica (Taylor et al. 2022), designed for scientific and medical applications, enhances literature analysis and information retrieval.

Recent models incorporate instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF) to improve the accuracy of medical text generation and knowledge extraction. Med-PaLM (Singhal et al. 2023) and Med-PaLM 2 (Singhal et al. 2025) exemplify this trend, refining medical question answering and clinical decision-making. Med-Alpaca (Han et al. 2023) further demonstrates the adaptability of fine-tuned language models for specialized healthcare applications. Meanwhile, GatorTronGPT (Peng et al. 2023) builds on the GatorTron architecture with targeted fine-tuning, enhancing its precision in medical report generation. Conversational AI models like ChatDoctor (Y. Li et al. 2023) is tailored for virtual medical consultations, offering patient triage assistance and personalized recommendations.

Beyond these, several Med-LLMs focus on domain-specific adaptations. PMC-LLaMA (C. Wu et al. 2023) enhances biomedical literature processing, aiding both academic research and clinical applications. GPT-4-Med (Nori et al. 2023), a refined adaptation of GPT-4, excels in complex clinical text processing and high-quality medical content generation. In the field of Traditional Chinese Medicine (TCM), models like Taiyi-LLM (Luo et al. 2024), and Zhongjing (S. Yang et al. 2024) integrate classical TCM literature with modern medical insights, supporting diagnosis and treatment planning. Additionally, advancements in multilingual and multimodal medical models have broadened AI’s applicability in global healthcare. HuatuoGPT (Zhang et al. 2023) and its successor HuatuoGPT-II (Chen et al. 2023) leverage expanded datasets and optimized architectures to improve clinical report generation and diagnostic decision support. Med-Flamingo (Moor et al. 2023) extends Med-LLM capabilities to multimodal medical tasks, integrating textual and visual information. Med-Gemini (Saab et al. 2024), a bilingual model, facilitates cross-lingual medical communication, promoting international healthcare collaboration.

These advancements underscore the ongoing evolution of Med-LLMs, enhancing their ability to process complex medical language, integrate multimodal data, and support diverse healthcare applications. As these models continue to evolve, they hold the potential to significantly improve clinical decision-making, personalized medicine, and cross-cultural medical communication.

Medical Benchmarks

The development of diverse and standardized datasets, along with robust evaluation platforms, is essential for advancing AI applications in the medical domain. Existing research in this area can be broadly categorized into two main directions: (1) datasets tailored for various medical AI tasks and (2) automated benchmarks designed to assess the clinical capabilities of large models.

The first category consists of datasets that support tasks such as information extraction, question answering, text generation, and natural language inference. For instance, datasets like GENIA (Kim et al. 2003), CADEC (Karimi et al. 2015), and BC5CDR (Jiao Li et al. 2016) are widely used for named entity recognition, relation extraction, and event detection across biomedical literature and clinical records. Meanwhile, MedQA (D. Jin et al. 2021), PubMedQA (Q. Jin et al. 2019), CMCQA (Xia et al. 2022), and Huatuo-26M (Jianquan Li et al. 2023) have been developed to evaluate models’ abilities in medical knowledge retrieval, clinical reasoning, and diagnostic decision-making. Additionally, datasets such as MIMIC-III (Johnson et al. 2016), MIMIC-CXR (Johnson et al. 2019), HealthSearchQA (Singhal et al. 2023), and CORD-19 (Wang et al. 2020) facilitate tasks like clinical report generation, summarization, and case-based discussions. In the natural language inference domain, MedNLI (Romanov and Shivade 2018) provides a benchmark for understanding logical relationships in medical texts. Recently, MedReason (J. Wu et al. 2025) was proposed to address the scarcity of high-quality, step-by-step reasoning data in the medical domain. Unlike datasets distilled directly from general-purpose LLMs, MedReason constructs 32,682 question–answer pairs with detailed Chain-of-Thought explanations, leveraging a structured medical knowledge graph to extract and guide the reasoning paths. These reasoning chains are factually grounded and validated through both automated answer checking and expert review by medical professionals from diverse specialties.

The second category focuses on automated benchmarks for evaluating large medical models, reducing reliance on expert-driven manual assessments. MedBench (Cai et al. 2024) provides a broad evaluation platform with 40,041 questions covering various medical fields. AutoEval (Liao et al. 2023) reformats USMLE questions into multi-turn dialogues, assessing models based on information coverage and task accuracy. LLM-Mini-CEX (Shi et al. 2023) leverages patient simulators and ChatGPT to evaluate diagnostic dialogue quality. MedGPTEval (Xu et al. 2023) integrates Chinese medical datasets and public benchmarks, using 16 expert-refined indicators to measure professional competence. LLM-Human Evaluation (Chiang and Lee 2023) examines automated assessment feasibility, showing alignment with human evaluators in adversarial and open-ended tasks. These frameworks systematically measure model performance, lower assessment costs, and support medical AI optimization.

Medical Benchmark with Step-wise evaluation

Data curation

In this section, we detail our process for curating the data and the specifics of the dataset we collected. The dataset we curated consisted of two types: multiple-choice questions and open-ended questions. In the following, we develop a description of the process of curating each of these two types of data.

Data curation for multiple-choice question

The data collection for our MedReason-Dx benchmark is designed to create a challenging reasoning dataset that diverges from typical knowledge-based question-answer datasets. The objective is to curate a dataset where models are required to perform complex, multi-step reasoning to derive the correct answers, reflecting the intricate processes involved in real-world clinical diagnostics. To achieve this, we employed a rigorous data selection process that involved filtering questions from well-established medical datasets, which encompass real-world clinical cases across various medical disciplines. Each problem is associated with a detailed series of reasoning steps that mirror the diagnostic workflow, ensuring that the reasoning process is both comprehensive and contextually relevant. We define a set of 24 medical domains, derived from common hospital departments, including: "Cardiology", "Pulmonology", "Gastroenterology", "Nephrology", "Endocrinology", "Hematology", "Rheumatology", "Neurology", "Surgery", "Obstetrics and Gynecology", "Pediatrics", "Psychiatry", "Emergency Medicine", "Anesthesiology", "Radiology", "Otorhinolaryngology", "Ophthalmology", "Dermatology", "Urology", "Oncology", "Physical Medicine and Rehabilitation", "Nutrition", "Pain Management" and "Clinical Laboratory". The selection of questions prioritizes diversity in the types of clinical challenges and the reasoning methods required for problem-solving. This diversity encompasses a wide array of diagnostic tasks that span both common and rare clinical conditions. Questions are specifically chosen for their requirement of complex multi-step reasoning processes, including, but not limited to, physiological mechanism analysis, differential diagnosis, hypothesis testing, exclusionary reasoning, and the integration of cross-disciplinary knowledge. By focusing on the reasoning complexity and diversity, the dataset reflects the multifaceted nature of clinical decision-making and the diverse set of cognitive strategies employed by healthcare professionals in practice. The aim is to ensure that the dataset not only captures the breadth of medical knowledge but also challenges models to engage in higher-order reasoning reflective of real-world medical diagnostic scenarios

Table 1: Comparison with existing Medical QA benchmarks
Benchmark	CoT Evaluation	No. Domains	reasoning intensive	MCQ	OEQ	Expert Annotation
MMedBench	✘	21	✔	✔	✘	✔
MedQA	✘	-	✘	✔	✘	✘
MedMCQA	✘	21	✘	✔	✘	✔
MMLU	✘	6	✘	✔	✘	✘
Medbullets	✘	-	✔	✔	✘	✔
JAMA Challenge	✘	13	✔	✔	✘	✔
LiveQA	✘	-	✘	✘	✔	✔
ClinicBench	✘	-	✘	✔	✔	✔
Ours	✔	24	✔	✔	✔	✔

After finalizing the selection of challenging questions, we create the step-by-step answers and extract key points with the help of medical experts from diverse specialties. This approach allows for a comprehensive evaluation of the model’s reasoning ability, focusing not only on the correctness of the final answer but also on the clarity and logic of the reasoning process itself. The step-by-step answers break down the reasoning into clear, logical steps, each representing a critical part of the decision-making process. The key points highlight the most important information, such as clinical findings or diagnostic considerations, necessary for reaching the correct diagnosis. These key points help ensure that the model’s response covers all relevant aspects required for accurate medical decision-making. The goal of generating these steps and key points is to assess the reasoning process in detail, ensuring that the model’s explanation is complete and logically sound. An example piece of data created is shown in Figure 2.

Data curation for open-ended question

Multiple-choice questions often simplify the difficulty of the questions and fail to accurately reflect real-world scenarios, as human doctors can’t make diagnoses based on predefined options. Consequently, we further construct open-ended reasoning questions.

When rewriting the problem, we only change the last question in the problem to ensure minimal changes to the original problem. Again, we invite human experts to monitor this change to ensure that it is done correctly. Upon obtaining the open-ended questions and answers, we reformulate the answers in a STEP-BY-STEP format, similar to the approach used for multiple-choice questions. Additionally, we extract the key points from the answers to facilitate subsequent assessments.

Evaluation

As previously mentioned, in addition to emphasizing the accuracy of the final answer, we also place significant attention on the comprehensiveness of the reasoning process and the thoroughness with which the relevant keywords are captured during this process. In the following, we provide a detailed description of the evaluation metrics employed to assess these three critical aspects of our analysis.

Correctness of the final answer. Firstly, in accordance with traditional evaluation methodologies, we assess the accuracy of the final answers provided by the model. For multiple-choice questions, we directly compare the option selected by the model with the correct one. For open-ended questions, we instruct the model to provide the answer in a specified format: "Please give your answer in the following format: "Therefore, the answer is ∖box{your answer}." Then we use LLM to determine whether the given answer is equivalent to the correct answer and calculate accuracy.

Completeness and necessity of reasoning steps. In addition to evaluating the correctness of the final answers, we also assess the completeness and necessity of the model’s reasoning process. While current evaluations of large medical models typically emphasize the correctness of the results, it is crucial to recognize that, in high-stakes domains such as medicine, the transparency and rationality of the model’s reasoning process are equally important. This is because physicians rely on the reasoning behind the model’s conclusions to assess the reliability of the final results. Therefore, we also conduct a thorough evaluation of the model’s reasoning process to ensure its robustness and interpretability.

Completeness of keywords. In addition to assessing the model’s reasoning process, we also evaluate the completeness of the keywords involved in its reasoning.

Experiments

In this section, we will show the detailed experimental setup, analyses, and results of different LLMs in our benchmark. The dataset is still being refined, and more results will be released soon.

Experimental Setup

Evaluation Models. To provide a comprehensive benchmark, We conduct evaluations on 11 advanced LLMs, comprising 7 general LLMs and 4 medical LLMs, including DeepSeeK R1 (Guo et al. 2025), DeepSeek V3 (Liu et al. 2024), GPT-4o (OpenAI 2024a), o1-mini (OpenAI 2024b) and Baichuan4-Turbo (A. Yang et al. 2023). It is important to note that our experiments are still ongoing and the results are being continuously refined, with additional evaluation data expected to be updated in the near future.

Implementation Details. Following previous work (Jiang et al. 2025), we leverage two types of prompts to guide the model to give answers: chain-of-thought prompts (Wei et al. 2022) and direct prompts. Chain-of-thought prompts have the following format: ’Give your answer in the following form with clear logic: Step1: Step2:.... . Therefore, the answer is \box{}.’ and direct prompts have the following format: ’Please answer the following question and end your answer in this format: Therefore, the answer is \box{}.’. When calling LLMs, the temperature is set to 0.7, Top-P is set to 0.9, and max tokens is set to 1000.

Benchmarking Medical LLMs

We benchmarked several advanced language models on the MedReason-Dx dataset under two prompting settings: Chain-of-Thought (CoT) and Direct Answering, across both multiple-choice and open-ended formats. Overall, DeepSeek-R1 achieved the highest performance in the multiple-choice setting, with CoT prompting slightly outperforming direct answering (65.03% vs. 64.36%). However, in the open-ended setting, its performance reversed, with direct prompting yielding higher accuracy (42.39%) than CoT (40.14%). DeepSeek-V3 showed a similar trend with modest gains from direct answering in open-ended questions (37.02% vs. 33.56%). Interestingly, GPT-4o exhibited the largest gap in favor of direct prompting for open-ended questions (47.70% vs. 37.72%), while maintaining comparable results in multiple-choice settings. o1-mini demonstrated relatively balanced performance across settings, with a slight edge for direct prompting in both question types. In contrast, Baichuan4-Turbo underperformed across all configurations, with particularly low scores on open-ended questions, indicating a significant gap in step-by-step reasoning capabilities compared to stronger models. These results suggest that while CoT prompting can provide marginal gains in structured formats, direct answering may be more effective in complex open-ended clinical scenarios, particularly for stronger LLMs.

Conclusion

In this work, we introduce MedReason-Dx, a benchmark designed to evaluate the quality of reasoning in medical question answering, beyond mere answer accuracy. MedReason-Dx incorporates expert-annotated, step-by-step rationales across diverse medical domains, enabling systematic assessment of logical coherence, interpretability, and clinical reasoning reliability in large language models. By addressing key limitations in existing medical AI evaluation practices, our benchmark provides a robust foundation for developing models that reason transparently and reliably.

Table 2: Benchmarking the accuracy (%) performance of existing models
	multiple-choice	open-ended
	CoT	Direct	CoT	Direct
DeepSeeK R1	65.03	64.36	40.14	42.39
DeepSeeK V3	60.47	59.79	33.56	37.02
GPT-4o	58.28	59.12	37.72	47.70
o1-mini	60.47	62.67	37.37	41.18
Baichuan4-Turbo	46.62	42.74	27.16	28.37

Cai, Yan, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, and Liang He. 2024. “Medbench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models.” In Proceedings of the AAAI Conference on Artificial Intelligence, 38:17709–17. 16.

Chen, Junying, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, et al. 2023. “Huatuogpt-Ii, One-Stage Training for Medical Adaption of Llms.” arXiv Preprint arXiv:2311.09774.

Chiang, Cheng-Han, and Hung-yi Lee. 2023. “Can Large Language Models Be an Alternative to Human Evaluations?” arXiv Preprint arXiv:2305.01937.

Gu, Yu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.” ACM Transactions on Computing for Healthcare (HEALTH) 3 (1): 1–23.

Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, et al. 2025. “Deepseek-R1: Incentivizing Reasoning Capability in Llms via Reinforcement Learning.” arXiv Preprint arXiv:2501.12948.

Han, Tianyu, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. “MedAlpaca–an Open-Source Collection of Medical Conversational AI Models and Training Data.” arXiv Preprint arXiv:2304.08247.

Jiang, Dongzhi, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, et al. 2025. “MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency.” arXiv Preprint arXiv:2502.09621.

Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. “What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.” Applied Sciences 11 (14): 6421.

Jin, Qiao, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. 2019. “Pubmedqa: A Dataset for Biomedical Research Question Answering.” arXiv Preprint arXiv:1909.06146.

Johnson, Alistair EW, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. 2019. “MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports.” Scientific Data 6 (1): 317.

Johnson, Alistair EW, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. “MIMIC-III, a Freely Accessible Critical Care Database.” Scientific Data 3 (1): 1–9.

Karimi, Sarvnaz, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. “Cadec: A Corpus of Adverse Drug Event Annotations.” Journal of Biomedical Informatics 55: 73–81.

Kim, J-D, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. “GENIA Corpus—a Semantically Annotated Corpus for Bio-Textmining.” Bioinformatics 19 (suppl_1): i180–82.

Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. “BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining.” Bioinformatics 36 (4): 1234–40.

Li, Jianquan, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023. “Huatuo-26m, a Large-Scale Chinese Medical Qa Dataset.” arXiv Preprint arXiv:2305.01526.

Li, Jiao, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. “BioCreative v CDR Task Corpus: A Resource for Chemical Disease Relation Extraction.” Database 2016.

Li, Yunxiang, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. “Chatdoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-Ai (Llama) Using Medical Domain Knowledge.” Cureus 15 (6).

Liao, Yusheng, Yutong Meng, Hongcheng Liu, Yanfeng Wang, and Yu Wang. 2023. “An Automatic Evaluation Framework for Multi-Turn Medical Consultations Capabilities of Large Language Models.” arXiv Preprint arXiv:2309.02077.

Liévin, Valentin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2024. “Can Large Language Models Reason about Medical Questions?” Patterns 5 (3).

Liu, Aixin, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. 2024. “Deepseek-V3 Technical Report.” arXiv Preprint arXiv:2412.19437.

Lu, Qiuhao, Dejing Dou, and Thien Nguyen. 2022. “ClinicalT5: A Generative Language Model for Clinical Text.” In Findings of the Association for Computational Linguistics: EMNLP 2022, 5436–43.

Luo, Ling, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, et al. 2024. “Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks.” Journal of the American Medical Informatics Association 31 (9): 1865–74.

Moor, Michael, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. 2023. “Med-Flamingo: A Multimodal Medical Few-Shot Learner.” In Machine Learning for Health (ML4H), 353–67. PMLR.

Nori, Harsha, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. “Capabilities of Gpt-4 on Medical Challenge Problems.” arXiv Preprint arXiv:2303.13375.

OpenAI. 2024a. “Hello GPT-4o.” Https://Openai.com/Index/Hello-Gpt-4o/.

———. 2024b. “Learning to Reason with Llms.” Https://Openai.com/Index/Learning-to-Reason-with-Llms/.

Peng, Cheng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, et al. 2023. “A Study of Generative Large Language Model for Medical Research and Healthcare.” NPJ Digital Medicine 6 (1): 210.

Romanov, Alexey, and Chaitanya Shivade. 2018. “Lessons from Natural Language Inference in the Clinical Domain.” arXiv Preprint arXiv:1808.06752.

Saab, Khaled, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, et al. 2024. “Capabilities of Gemini Models in Medicine.” arXiv Preprint arXiv:2404.18416.

Shi, Xiaoming, Jie Xu, Jinru Ding, Jiali Pang, Sichen Liu, Shuqing Luo, Xingwei Peng, et al. 2023. “Llm-Mini-Cex: Automatic Evaluation of Large Language Model for Diagnostic Conversation.” arXiv Preprint arXiv:2308.07635.

Singhal, Karan, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. 2023. “Large Language Models Encode Clinical Knowledge.” Nature 620 (7972): 172–80.

Singhal, Karan, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, et al. 2025. “Toward Expert-Level Medical Question Answering with Large Language Models.” Nature Medicine, 1–8.

Taylor, Ross, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. “Galactica: A Large Language Model for Science.” arXiv Preprint arXiv:2211.09085.

Wang, Lucy Lu, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Douglas Burdick, Darrin Eide, et al. 2020. “Cord-19: The Covid-19 Open Research Dataset.” ArXiv, arXiv–2004.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In Advances in Neural Information Processing Systems 35.

Wu, Chaoyi, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. “Pmc-Llama: Further Finetuning Llama on Medical Papers.” arXiv Preprint arXiv:2304.14454 2 (5): 6.

Wu, Juncheng, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, et al. 2025. “MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs.” arXiv Preprint arXiv:2504.00993.

Xia, Fei, Bin Li, Yixuan Weng, Shizhu He, Kang Liu, Bin Sun, Shutao Li, and Jun Zhao. 2022. “LingYi: Medical Conversational Question Answering System Based on Multi-Modal Knowledge Graphs.” arXiv Preprint arXiv:2204.09220.

Xu, Jie, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru Ding, et al. 2023. “Medgpteval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine.” arXiv Preprint arXiv:2305.07340.

Yang, Aiyuan, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, et al. 2023. “Baichuan 2: Open Large-Scale Language Models.” arXiv Preprint arXiv:2309.10305.

Yang, Songhua, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. 2024. “Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model Through Expert Feedback and Real-World Multi-Turn Dialogue.” In Proceedings of the AAAI Conference on Artificial Intelligence, 38:19368–76. 17.

Yang, Xi, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, et al. 2022. “Gatortron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records.” arXiv Preprint arXiv:2203.03540.

Zhang, Hongbo, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, et al. 2023. “Huatuogpt, Towards Taming Language Model to Be a Doctor.” arXiv Preprint arXiv:2305.15075.

Introduction