Comprehensive Evaluation of Large Language Models (LLMs) in Biomedical Question Answering and Bioinformatics Research
Main Article Content
Abstract
Recent developments in Artificial Intelligence (AI), particularly Large Language Models (LLMs), have demonstrated exceptional capabilities in understanding and generating human-like text. Examples of such models used in bioinformatics include the Generative Pre-trained Transformer (GPT), Pathways Language Model (PaLM), and Biomedical Generative Pre-trained Transformer (BioGPT). With their advanced natural language processing (NLP) capabilities and unique advantages in biological applications, these models represent significant advancements in the application of LLMs to bioinformatics. This paper examines the effectiveness, usefulness, potential applications, and limitations of LLMs in bioinformatics research. It explores their use in literature mining, biomedical question answering, and information retrieval from PubMed and other biomedical databases. Additionally, the study discusses integration strategies, ethical considerations, and benchmarking approaches for incorporating LLMs into bioinformatics workflows. The findings indicate that although LLMs demonstrate significant potential in bioinformatics, challenges related to interpretability and domain-specific fine-tuning remain important obstacles. The paper focuses specifically on evaluating LLMs for Biomedical Question Answering (QA), a critical task in healthcare informatics and medical research. Comparative analysis reveals that GPT-4 achieves higher overall accuracy (82%), while BioGPT demonstrates superior domain-specific performance, including 95% citation accuracy. The study also identifies limitations across all evaluated models in handling complex reasoning tasks and rare disease knowledge. The results suggest that existing LLMs require complementary capabilities to achieve optimal performance in biomedical applications and provide practical guidance for selecting appropriate models based on clinical or research objectives. Finally, the paper presents a comprehensive evaluation framework for medical question-answering systems while highlighting ongoing challenges and future research directions for advancing AI-driven bioinformatics.
Article Details
Section
How to Cite
References
**References**
[1] I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks,” *Computers in Biology and Medicine*, vol. 171, p. 108189, Mar. 2024. doi:10.1016/j.compbiomed.2024.108189.
[2] “Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations,” *Nature Communications*. Accessed: Aug. 05, 2025. [Online]. Available: https://www.nature.com/articles/s41467-025-56989-2/metrics
[3] R. Luo *et al.*, “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining,” *Briefings in Bioinformatics*, vol. 23, no. 6, p. bbac409, Nov. 2022. doi:10.1093/bib/bbac409.
[4] Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A Dataset for Biomedical Research Question Answering,” *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, Hong Kong, China, pp. 2567–2577, 2019. doi:10.18653/v1/D19-1259.
[5] M. Yousef, A. Kumar, and B. Bakir-Gungor, “Application of Biological Domain Knowledge-Based Feature Selection on Gene Expression Data,” *Entropy*, vol. 23, no. 1, p. 2, 2020. doi:10.3390/e23010002.
[6] “Explainable AI, LIME & SHAP for Model Interpretability,” DataCamp. Accessed: Aug. 05, 2025. [Online]. Available: https://www.datacamp.com/tutorial/explainable-ai-understanding-and-trusting-machine-learning-models
[7] V. Liévin, C. E. Hother, A. G. Motzfeldt, and O. Winther, “Can Large Language Models Reason About Medical Questions?,” *Patterns*, vol. 5, no. 3, p. 100943, Mar. 2024. doi:10.1016/j.patter.2024.100943.
[8] E. A. Golemis *et al.*, “Molecular Mechanisms of the Preventable Causes of Cancer in the United States,” *Genes & Development*, vol. 32, no. 13–14, pp. 868–902, Jul. 2018. doi:10.1101/gad.314849.118.
[9] L. Cai, J. Li, H. Lv, W. Liu, H. Niu, and Z. Wang, “Integrating Domain Knowledge for Biomedical Text Analysis into Deep Learning: A Survey,” *Journal of Biomedical Informatics*, vol. 143, p. 104418, Jul. 2023. doi:10.1016/j.jbi.2023.104418.
[10] S. Raza, D. J. Reji, F. Shajan, and S. R. Bashir, “Large-Scale Application of Named Entity Recognition to Biomedicine and Epidemiology,” *PLOS Digital Health*, vol. 1, no. 12, p. e0000152, Dec. 2022. doi:10.1371/journal.pdig.0000152.
[11] E. Pasche, A. Mottaz, D. Caucheteur, J. Gobeill, P.-A. Michel, and P. Ruch, “Variomes: A High Recall Search Engine to Support the Curation of Genomic Variants,” *Bioinformatics*, vol. 38, no. 9, pp. 2595–2601, 2022. doi:10.1093/bioinformatics/btac146.
[12] O. A. Sarumi and D. Heider, “Large Language Models and Their Applications in Bioinformatics,” *Computational and Structural Biotechnology Journal*, 2024. doi:10.1016/j.csbj.2024.09.031.
[13] “Confidence Information Ontology: A Step Towards a Standard for Asserting Confidence in Annotations,” *Database (Oxford Academic)*. Accessed: Aug. 05, 2025. [Online]. Available: https://academic.oup.com/database/article/doi/10.1093/database/bav043/2433175
[14] G. Tsatsaronis *et al.*, “An Overview of the BIOASQ Large-Scale Biomedical Semantic Indexing and Question Answering Competition,” *BMC Bioinformatics*, vol. 16, p. 138, Apr. 2015. doi:10.1186/s12859-015-0564-6.
[15] Ministerial Meeting on Population of the Non-Aligned Movement, “Denpasar Declaration on Population and Development,” *Integr. Tokyo Jpn.*, no. 40, pp. 27–29, Jun. 1994. doi:10.1234/2013/999990.
[16] H. Kim and A. Ademola, “Enhancing QA System Evaluation: An In-Depth Analysis of Metrics and Model-Specific Behaviors,” *Journal of Information Science Theory and Practice*, vol. 13, no. 1, pp. 85–98, 2025. doi:10.1633/JISTaP.2025.13.1.6.
[17] G. Kell *et al.*, “Question Answering Systems for Health Professionals at the Point of Care—A Systematic Review,” *Journal of the American Medical Informatics Association (JAMIA)*, vol. 31, no. 4, pp. 1009–1024, 2024. doi:10.1093/jamia/ocae015.
[18] S. Chen *et al.*, “Evaluating the ChatGPT Family of Models for Biomedical Reasoning and Classification,” *Journal of the American Medical Informatics Association (JAMIA)*, vol. 31, no. 4, pp. 940–948, 2024. doi:10.1093/jamia/ocad256.
[19] K. Singhal *et al.*, “Toward Expert-Level Medical Question Answering with Large Language Models,” *Nature Medicine*, vol. 31, no. 3, pp. 943–950, 2025. doi:10.1038/s41591-024-03423-7.
[20] I. Ahmed and S. Ishtiaq, “Reliability and Validity: Importance in Medical Research,” *Journal of the Pakistan Medical Association (JPMA)*, vol. 71, no. 10, pp. 2401–2406, Oct. 2021. doi:10.47391/JPMA.06-861.
[21] D. Stribling, Y. Xia, M. K. Amer, K. S. Graim, C. J. Mulligan, and R. Renne, “The Model Student: GPT-4 Performance on Graduate Biomedical Science Exams,” *Scientific Reports*, vol. 14, p. 5670, Mar. 2024. doi:10.1038/s41598-024-55568-7.
[22] A. R. Y. B. Lee *et al.*, “Efficacy of COVID-19 Vaccines in Immunocompromised Patients: A Systematic Review and Meta-Analysis,” *The BMJ*, vol. 376, p. e068632, Mar. 2022. doi:10.1136/bmj-2021-068632.