A multidisciplinary team at UT Southwestern Medical Center has developed an AI-enabled pipeline that can quickly and accurately extract relevant information from complex, free-text medical records. The team’s novel approach, published in npj Digital Medicine, could dramatically reduce the time needed to create analysis-ready data for research studies.
“Constructing highly detailed, accurate datasets from free-text medical records is extremely time-consuming, often requiring extensive manual chart review,” said study first author David Hein, M.S., Data Scientist in the Lyda Hill Department of Bioinformatics at UT Southwestern.
“Our study demonstrates one approach for creating AI-powered large language models (LLMs) that simplify the process of collecting and organizing medical data for analysis. By automating both data extraction and standardization through AI, we can make large-scale clinical research more efficient.”
To develop the pipeline, researchers used an AI-powered LLM to analyze more than 2,200 kidney cancer pathology reports to evaluate the model’s ability to recognize and categorize distinct types of tumors.
Through close collaboration with AI scientists, pathologists, clinicians, and statisticians, they refined the workflow through multiple rounds of testing, improving its handling of complex, nuanced information. Their findings were validated against existing electronic medical record (EMR) data to ensure reliability.
The results were striking—99% accuracy in identifying tumor types and 97% accuracy in detecting whether the cancer had metastasized.
“The biggest challenge in training AI to extract data from narrative reports is that clinicians use a wide range of open-ended terms to describe the same finding,” said study co-leader Payal Kapur, M.D., Professor of Pathology and Urology. “It’s not as simple as counting ‘yes–no’ results. Every report contains hundreds of details in narrative form. But with proper input and oversight, an AI model can efficiently review and categorize vast amounts of records with speed and accuracy.”
A final step included testing across a broader dataset of more than 3,500 internal kidney cancer pathology reports with similar results—a process facilitated by the high-quality, curated data and pipelines available through UT Southwestern’s Kidney Cancer Program.
“The key is collaborative teamwork across specialties to refine AI instructions and ensure accuracy,” said study co-author James Brugarolas, M.D., Ph.D., Director of the Kidney Cancer Program, Professor of Internal Medicine in the Division of Hematology and Oncology, and member of the Cellular Networks in Cancer Research Program of the Harold C. Simmons Comprehensive Cancer Center.
While this study focused on kidney cancer, the approach may have broader applications to other tumor types, the authors said.
“There is no ‘one-size-fits-all’ model for medical data extraction,” said study co-leader Andrew Jamieson, Ph.D., Assistant Professor and Principal Investigator in the Lyda Hill Department of Bioinformatics.
“But our study outlines key strategies that can help other researchers use AI-powered LLMs more effectively in their own specialties. We’re excited to continue refining this process and expanding AI’s role in medical research.”
More information:
David Hein et al, Iterative refinement and goal articulation to optimize large language models for clinical information extraction, npj Digital Medicine (2025). DOI: 10.1038/s41746-025-01686-z
UT Southwestern Medical Center
Citation:
AI system streamlines extraction of key data from medical records (2025, July 29)
retrieved 29 July 2025
from https://medicalxpress.com/news/2025-07-ai-key-medical.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.