Accuracy of LLMs to retrieve numeric data for meta-analysis in dentistry
Loading...
Official URL
Full text at PDC
Publication date
2025
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier
Citation
Caponio VCA, Lorenzo-Pouso AI, Magalhaes M, Ali A, Adamo D, Cirillo N, López-Pintor RM, Musella G. Accuracy of LLMs to retrieve numeric data for meta-analysis in dentistry. J Dent. 2026 Jan;164:106245. doi: 10.1016/j.jdent.2025.106245
Abstract
Objectives: Evidence-based dentistry relies heavily on systematic reviews and meta-analyses (SRMA), considered the most robust forms of evidence. Still, conducting SRMA is time- and resource-intensive, with high error rates in data extraction. Artificial intelligence (AI) and large language models (LLMs) offer the potential to automate and accelerate SRMA processes such as data extraction. However, assessing the reliability and accuracy of these new AI-based technologies for SRMA is crucial. This study evaluated the accuracy of four LLMs (DeepSeek v3 R1, Claude 3.5 Sonnet, ChatGPT-4o, and Gemini 2.0-flash) in extracting different primary numeric outcomes data in various dental topics.
Methods: LLMs were queried via APIs using default settings and a SMART-format prompt. Descriptive analysis was conducted at sub-outcome, outcome, and study levels. Errors were classified as hallucinations, missed, or omitted data.
Results: Overall extraction accuracy was exceptionally high at the sub-outcome level, with only 3 hallucinations (from Gemini 2.0-flash). Total errors increased at the outcome level and study level. Gemini 2.0-flash generally performed significantly worse than others (p < 0.01). Claude 3.5 Sonnet and DeepSeek-v3 R1 generally exhibited superior accuracy and lower omission rates in full-text extraction compared to Gemini 2.0-flash and ChatGPT-4o.
Conclusions: This first comparative evaluation of multiple LLMs for data extraction in dental research from full-text PDFs highlights their significant potential but also limitations. Performance varied notably between models, with cost not directly correlating with superior performance. While single data point extraction was highly accurate, errors increased at higher aggregation levels. Standardized outcome reporting in studies could benefit future LLM extraction, and we offer a solid benchmark for future performance comparisons.
Clinical significance: This study demonstrates that LLMs can achieve high accuracy in extracting single numeric outcomes, but omission errors in full-text analyses limit their independent use in SRMA. Improving outcome reporting standards and leveraging accurate, lower-cost models may enhance evidence synthesis efficiency in dentistry and beyond.












