Medical data extraction from legacy databases – case study
Faculty of Postgraduate Medical Education
Abstract
Here we present extracting relevant medical information from a free text database. The
documentation included individual records of 19 694 patients treated at the Center for
Diagnosing and Treatment of Asthma and Allergy, Medical University of Lodz between years
1995 and 2006. The database was based on legacy engine with no export feature and
fragmentary documentation. The aim of the study was to data mine relevant clinical data for
asthmatic patients 12-months prior and 36 months after the index date. Index event was
defined as adding montelukast or salmeterol to present therapy or excluding salmeterol from
therapy. The results of this retrospective observational study were in agreement with previous
Introduction
In 2002, the English National Health Service (NHS) began the process of transforming its
health-care system with information technology. The experts assess the costs have already
doubled reaching $24 billion and for some the project is sleepwalking toward disaster.
Nevertheless wide adoption of databases in medicine is of paramount significance. It is not
only the matter of cutting down administrative inefficiencies in healthcare but also saving
lives having all the crucial patients information always at hand. Beyond the single Electronic
Health Record lies the true big picture however. That is the ability to query the whole
population of patients’ data for treatment-outcomes relationships on a truly Evidence-Based
Medicine basis. This could also mean detecting dangerous drugs interactions, undetectable at
the moment without large targeted randomized clinical trials.
The aim of the study was to data mine relevant clinical data for asthmatic patients 12-months
prior and 36 months after the index date. Index event was defined as adding montelukast or
salmeterol to present therapy or excluding salmeterol from therapy.
The documentation included individual records of 19 694 of patients treated at the Center for
Diagnosing and Treatment of Asthma and Allergy, Medical University of Lodz between years
1995 and 2006. It amounted to about 70 thousand pages of clinical data collected in a textual
database. Each entry was personally input by a doctor working at the Centre during patients’
visits and consisted of an interview, physical examination, laboratory results and prescribed
drugs. All fields were unstructured text only.
We began by importing the records into a Microsoft Access database. This was achieved
through a VBA routine that was tailored against the available documentation and reverse
engineering of the database rudimentary relational model. The database maker was no longer
reachable and original company ceased to exist.
The relevant entity extraction was linguistic based and employed heuristic rules and shallow
parsing techniques on specific parts of the text around certain keywords. It was inspired by
the works of Friedman et al. and their MedLEE extraction system. The aim was not only to
extract some basic symptoms like daytime dyspneas or wheeze but also prescribed doses,
occurrence of certain events (for instance asthma related hospitalizations) and pulmonary
function tests values. It was crucial for the scientific relevance of acquired data that all entities
are recalled. This, however, resulted in a high percentage (30%) of data being misclassified.
This first fully automated query resulted in a preliminary set narrowed down to about 250
records that met the stringed inclusion criteria. These were than manually checked for errors
of automation on case by case basis and the study query was repeated resulting in the final set
of 189 patients and their respective results for all the periods assessed as schematically shown
Because the fields were text only and actually no strict rules were imposed on filling in the
data, we found it difficult to automate the process of extracting information. The key
problems identified were: spanning the information over more than one field, typos,
ambiguous abbreviations, shorthand and hyphenation. To account for discrepancy between
visit date and outcomes that occurred days or months earlier, we found that a separate layer
must be created that stores the events on a day by day basis. This allowed for instance to
precisely compute the average prescribed daily doses (exposure).
During the observation period of ten years spirometry equipment was modernized, which
resulted in a different reference range for the pulmonary function tests. That is why we found
it more reasonable to extract the equivalents expressed in percentage of the predicted value.
As the time factor was involved, over the years there were also subtle changes in therapy
guidelines and also some proprietary drug forms have left the market. This all had to be taken
Overall we found the acquired data valid. Repeated Measures ANOVA used to statistically
analyze the results showed two already known trends in asthma. These were the presence of
synergy between salmeterol and inhaled corticosteroids and montelukast positive influence on
allergic rhinitis. The results were valuable as no observational study in asthma of such length
Conclusions
We concluded that extracting relevant medical information from legacy databases is possible,
but the measures taken may be unfeasible on a larger scale due to time and resources
involved. Also textual sources although offering far greater flexibility are not particularly well
Future ease of exporting data should always be considered when deciding how to store
biomedical data. This is unfortunately rarely the case, as cheaper Database Management
Systems are chosen over more expensive solutions built to last.
References
1. Leroy G, Chen H, Martinez JD., A shallow parser based on closed-class words to
capture relations in biomedical text. J Biomed Inform. 2003 Jun;36(3):145-58.
2. Friedman C, Shagina L, Lussier Y, Hripcsak G., Automated encoding of clinical
documents based on natural language processing. J Am Med Inform Assoc. 2004 Sep-
3. Long-acting beta2-agonists versus anti-leukotrienes as add-on therapy to inhaled
corticosteroids for chronic asthma., Cochrane Database Syst Rev. 2005 Jan
Figure 1. Schematic of the final dataset creation Natural Language Processing, pre-selection Manual reclassification, cases reduction ‘Day by day’ layer Final dataset
Does the Use of Accutane Cause Depression and Suicide in Teenagers? This research paper is aimed to discuss the use of accutane in the context of triggering depression and suicidal tendencies among teenagers. Two contradicting statements on this topic from established medical journals will be scrutinized in terms of authenticity, evidence, assumptions, missing links and any ambiguity in reason
EQUINE RAZOR WORMER Chemwatch Material Safety Data Sheet For Workplace - Small Volume Use Only. CHEMWATCH 4614-46 Issue Date: 7-Sep-2006 CD 2006/3 Page 1 of 9 NC317ELP Section 1 - CHEMICAL PRODUCT AND COMPANY IDENTIFICATION PRODUCT NAME SYNONYMS PRODUCT USE Oral anthelmintic for horses. Administered into the back of a horse' s mouth. SUPPLIER Company: ArcherVet Pty