Intuitive Querying of e-Health Data Repositories Catalina Hallett, Richard Power, Donia Scott {C.Hallett, R.Power, D.Scott}@open.ac.uk
Abstract
At the centre of the Clinical e-Science Framework (CLEF) project is a repository of well organ-ised, detailed clinical histories, encoded as data that will be available for use in clinical care andin-silico medical experiments. An integral part of the CLEF workbench is a tool to allow biomed-ical researchers and clinicians to query – in an intuitive way – the repository of patient data. Thispaper describes the CLEF query editing interface, which makes use of natural language gener-ation techniques in order to alleviate some of the problems generally faced by natural languageand graphical query interfaces. The query interface also incorporates an answer renderer thatdynamically generates responses in both natural language text and graphics. Background
databases involves expressing queries in a
language that is understood by the database
aims at providing a data repository of well
management system (typically SQL). Direct SQL
organised clinical histories, which can be queried
querying requires specialist knowledge of the
and summarised both for biomedical research
both the query language and the structure of the
underlying database, and – in the case of medical
of the query interface is to provide efficient
databases – usually also knowledge of precise
access to aggregated data for performing a
variety of tasks, e.g., assisting in diagnosis or
be counter-productive to require this additional
treatment, identifying patterns in treatment,
level of technical expertise of the clinicians and
selecting subjects for clinical trials, monitoring
biomedical researchers who want to access the
the participants in clinical trials. The intended
users of this service are clinicians, biomedicalresearchers,
Attempts to overcome this problem in user
Our current domain is cancer; however, the
interfaces to medical databases have traditionally
framework in principle supports a wide range of
made use of graphical devices such as forms,
diagrams, menus, or pointers to communicate to
An analysis of free text queries written by
the user the information content of a database
medical professionals show that they are mostly
(e.g., KNAVE (Shahar and Cheng, 1999) and
TrialDB (Deshpande et al., 2001)), and research
makes the design of the query interface to the
shows that they are much preferred over textual
CLEF repository particularly difficult, since our
query languages such as SQL, especially by
users will need to construct complex queries
containing conditional and temporal structures.
empirical studies have reported high error rates
The CLEF repository of clinical histories
by domain experts using graphical modelling
currently contains some 20000 records of cancer
tools (Kim, 1990) and a clear advantage of
text over graphics for understanding nested
or ICD, and is implemented as a relational
conditional structures (Petre, 1995).
database that stores patient records modeled
However, it is also well-known that queries
on the archetype for cancer developed at UCL
expressed in free natural language are sensitive
Query analysis
ungrammaticalities) or processing (at the lexical,syntactic or semantic level). A further drawback
Types of queries
of natural language interfaces to databases is that
An analysis of real queries from clinical trials and
such systems normally understand only a subset
invented queries supplied by clinicians identified
of natural language, and it is not always clear to
two general types of queries, as exemplified
casual users which are the valid constructions
and whether the lack of response from the system
is due to the unavailability of an answer or to an
unaccepted input construction. On the positive
side, natural language is far more expressive than
SQL, so it is generally easier to ask complex
questions and manipulate temporal constructions
using natural language than using a database
The CLEF query interface
In the first example, the expected answer
is a comparison between a certain statisticalmeasure (in this case, percentage) applied on two
The CLEF query system is designed to answer
groups of patients differentiated by the treatment
questions relating to patterns in medical histories
over sets of patients in the data repository.
a statistical measure (average) computed for a
The current interface is designed for casual
certain parameter (number of investigations of
and moderate users who are familiar with the
type ”body scan”) of a group of patients with
semantic domain of the repository but not with
its technical implementation (e.g., clinicians,
For either of these queries, the attributes
medical researchers and hospital administrators).
involved in constructing the query can vary
For the reasons we described above, the guiding
within a certain range: any statistical measure
principle in the design of our interface is that its
can be used, the differentiating parameter could
use requires no prior knowledge of the structure
be the diagnosis instead of the treatment, etc.
of the repository, no expertise in database
Additionally, there are a number of variations
access languages such as SQL, no familiarity
to these two main types of queries. For both
with medical codes, and only minimal prior
types, the user may ask for simple assessment
repository is not through SQL, or graphics or freetext. Instead, query-construction is performed
by interacting with an automatically-generated
Natural Language feedback text (currently only
English). This interaction method, based on the
There are also cases where several similar
et al (Power and Scott, 1998), allows users
queries are combined into one more complex
of the profile described above to construct in
an intuitive way, unambiguous, syntactically
correct, complex natural language queries, such
For all these queries, there is practically no
limit to the complexity that can be achieved
description can in fact be a conjunction or
disjunction of diagnoses, and the same applies for
every concept included in a query. Therefore, the
Query editing interface General features
Conceptual authoring through WYSIWYM editing
(Power and Scott, 1998) alleviates the need
for expensive syntactic and semantic processing
of the queries by providing the users with an
supported by the query editor, and they are
interface for editing the conceptual meaning of
not considered separate types of queries, nor
a query instead of the surface text.
The WYSIWYM interface presents the contents
of a knowledge base to the user in the form of
Modeling queries
a feedback text. In the case of query editing,the content of the knowledge base is a yet to
For presentation reasons, queries have to be
be completed formal representation of the user’s
decomposed into constituents that can be easily
edited by the user. By way of exemplification,
a natural language text that corresponds with
let us consider the query type (1). There are
the incomplete query and guides them towards
three elements to the query: the set of relevant
editing a semantically consistent and complete
patients, defined by a problem; the partition of
this set according to treatment; and the further
control the interpretation that the system gives
partition according to outcome, from which the
a basic query frame, where concepts to be
complicated sentences, we consider a format in
instantiated (anchors) are clickable spans of text
which these elements are presented separately:
with associated pop-up menus containing optionsfor expanding the query. For example, one can
Relevant subjects:
start constructing a query that asks for a group of
patients fulfilling some conditions by editing thefollowing description:
Treatment profiles: Relevant subjects: Outcome measure: Treatment profile:
received [some treatment] Outcome: [measure] of [patients with
This breakdown allows the following basic
Relevant subjects: [Some patients]
Once the user selects an anchor and a new
value for the concept represented by the anchor,
Treatment profiles:
the semantic representation of the query is
updated and a new text is generated on the basis
combination of features or events of the same
Outcome measure:
type, thus allowing for complex queries, with
nested conditional structures to be built. Some
concept instances can also be typed in manually,which is useful for numerical values or other
Each of the bracketed elements are complex
fields with unpredictable content, such as names.
descriptions that model the concept definition in
This is also a way of enriching the ontology with
the CLEF archetype. For example, the concept
new concepts. Figure 1 is a snapshot of the query
diagnosis consists of the following obligatory
editor with a partially constructed query.
and optional components: tumour name, locus,
type (metastatic, primary, secondary) and TNMstaging code. Each of the subcomponents can be
selection over the feedback text is treated as an
extended through boolean operations (negation,
intermediate query, which is sent to the DBMS.
In return, the DBMS will transmit to the interface
a feedback answer. At this point, the feedback
main challenge is not to construct valid database
answer is a set of paired values representing the
queries from edited queries but to ensure that
number of patient records that match the query
the query the user is editing corresponds to the
and the percentage from the total number of
intended meaning. Therefore we want to ensure
records. There is also a further breakdown of
that the layout of the query conveys one meaning
patient records by sex, which was considered a
good discriminatory feature. For example, for an
intermediate query such as Number of patientsover the age of 60., the feedback answer could
based on the analysis of some real queries that
be 100 records (20% of 500), 55 men (55%), 45
could be given multiple interpretations. Several
categories of possible ambiguities are presented
As a further consistency checking mechanism,
below, along with the solution provided by the
the interface provides an additional rendering of
the query in running text, which is performed
When the phrase describing a relevance set
once the editing of the feedback query has
includes a conjunction or disjunction, there may
been completed, the user is presented with an
be ambiguity over whether the intended query is
alternative natural language query corresponding
single or multiple. Compare these three patterns:
to the structure that has been edited (output
schematic to allow for more intuitive editing, the
output query resembles in every respect a free
text query, thus being more natural and easier to
The natural language interface is database-
independent, since it does not require any
Example 7a is likely to be interpreted as two
knowledge of the database structure.
separate queries, while the others are ambiguous.
structure of the database is not only completely
Disjunctions like 7c occur often in real life
transparent to the user, but also to the interface
developer: changes at the database level require
no changes in the query editor. Queries can be
saved for later re-use, which is particularly useful
for frequent users who formulate queries with
Dealing with ambiguities
Since the processing of an edited query is
deterministic and transparent to the user, the
In this case, it is not clear if separate
myelodysplastic syndrome only and for acutemyelogenous leukaemia caused by bad prognosismyelodysplastic syndrome, or if it make sense to
give a single answer lumping these two groups
feedback texts by using different realisations forconjunctions/disjunctions that imply multiple
Specifying constraints and temporal
relevance sets, and conjunctions/disjunctions
relations
that do not. For example, we use bulleted lists
Guiding users towards editing correct and
for the former, and conjunction words (and, or)
complete queries is essential and is one of the
main points where our approach improves on
classical natural language query interfaces.
This is achieved by defining and implementing
of age who have had bad prognosismyelodysplastic syndrome only for at
years of age who have had acutemyelogenous leukaemia caused by bad
Static (or ontological) constraints relate
to the structure of the queries as defined in
the query model. This includes specifying the
super-class of an instance (for example, the
anchor cancer can only be instantiated with
names of cancers), its type (for example, age is
numeric and editable, while cancer is a static
string) and its status (compulsory vs optional). Dynamic constraints are triggered at runtime
by the user selection of certain instances. Most
In 9a we have two relevance sets; in 9b we
constraints simply serve the role of restricting
the user selection so that the resulting query
Similar ambiguities can be found when several
is meaningful and intelligible. In other cases,
treatment profiles are mentioned, or several
however, allowing the user to construct queries
outcome measures. In each case, the ambiguity
can be avoided in the WYSIWYM feedback texts
constraints could yield ambiguous queries.
the same way as before, by using bullets to mark
Dynamic contraints can be either conceptual,
which are compiled from a medical knowledge
base and represent depedencies between medical
properties. A description can be elaborate either
concepts (for example, nephroblastoma is a type
because it contains many boolean operators,
of kidney cancer, so users shouldn’t be allowed
to query for nephroblastoma in the left breast), or
numerical (for example, patients between 60 and
boolean combinations in running prose means
30 years of age is a disallowed construction).
that the scope of the operators can become
As medical records mirror the evolution in
ambiguous to the user. For this reason, layout
time of a patient, it is important to be able
is used to present boolean combinations more
to access the patient’s status at a certain point
natural language is an important advantage of
natural language query interfaces over graphical
interfaces. All temporal concepts in the medical
record are stamped with a valid time stamp,
event took place. Typically, a time interval is
1to a certain level of granularity imposed by the
representation of time instances in the database
Gender Age adenocarcinoma small cell carcinoma squamous cell carcinoma death
represented as a pair of start and end dates, where
in 4 age groups according to their gender andstart and end are discrete time values of a certain
histopathology diagnosis. 42 patients have beenreturned as a result to your query:
associates specific linguistic expressions to time
-in the 29-38 years age group there were 1
intervals. For example, between [date 1] andpatients (0 men and 1 woman): all patients were[date 2] is interpreted as a closed interval [date 1,
diagnosed with adenocarcinoma. [.]
date 2], in [this year] is interpreted as [01/01/this
-in the 49-58 age group, there were 27 patients(14 men and 13 women): 11 were diagnosed
cover most temporal queries, such as: patientswith adenocarcinoma, 5 were diagnosed withdiagnosed with cancer before 1999, patientssquamous cell carcinoma, 11 were diagnosedwho received chemotherapy within 5 months ofConclusions and further work Answer generation
We have presented in this paper a query interfaceto a repository of patient records which makes
A typical result set received from the DBMS
use of natural language generation techniques.
consists of lists of patients that fulfilled the
The query interface allows the editing of complex
requirements of the query, for each patient
queries and is a viable alternative to natural
having specified the age, gender, and the
language interfaces and visual query interfaces
values for each of the query elements.
to medical databases. Answers to queries are
example, a query such as Select all patients
provided in textual format using natural language
between the ages of 30 and 60 with a
generation techniques and also as tables and
clinical diagnosis of malignant neoplasm of
charts. The main features that set our approach
bronchus or lungs and histopathology diagnosis
apart from other querying interfaces to medical
of adenocarcinoma, small cell carcinoma orsquamous cell carcinoma, who were alive after10 years of the diagnosis, may yield the result set
• users require little training for using the
The result set is processed in such a way as to
allow the rendering of various groups of patients
• a set of semantic constraints are used
according to the age/gender breakdown and each
to guide users towards constructing valid
individual query term. For each individual search
queries only, therefore incorrect queries are
parameter, the data are split into a dynamically
determined number of age groups, and for each
age group the number of patients is further split
according to their gender. The result set thus
since ambiguity is dealt with in the editing
processed is presented to the user in three types
of format: tables, charts and text. Each individual
chart also contains an automatically generated
caption that explains the content of the chart. • the query interface has wider applicability
The captions are generated using template-
based techniques, where fillers are provided by
the same result set that was used for generating
the chart. For the bar chart in Fig. 3, a fragmentof the explanation provided in the caption reads:
Whilst the query editing interface is fully
This chart displays the distribution of patients
implemented, extending the range of queries
Figure 3: Generated bar chart: histopathology diagnosis/age/gender breakdown
supported is an ongoing effort. This is performed
M. Petre. 1995. Why looking isn’t always
in parallel with an evaluation of the usability
and user-friendliness of the interface.
expected that the evaluation will help formulatean extended range of queries and improve the
editing interface. The improved query interface
will provide means of interactively defining
texts. In Proceedings of 17th InternationalConference on Computational Linguistics
default values for instances that support them
(for example, one may want to default all index
Association for Computational Linguistics
events to the date of the first diagnosis). We also
(COLING-ACL 98), pages 1053–1059,
plan to extend the range of temporal operators to
include, for example, trend operators for clinical
Intelligent visualization and exploration
blood pressure, stationary haemoglobin count)
and define independent variables for reporting
Proceedings of HICSS, Maui, Hawaii.
statistical results (such as age groups, sex,education level). References
A. Deshpande, C. Brandt, and P. Nadkarni.
Meeting the needs of clinical studies. JournalInformatics Association, 9(4):369–382.
Dipak Kalra, Anthony Austin, A. O’Connor,
Implementation of a Federated HealthRecord Server, pages 1–13.
Records Institute for the Centre forAdvancement of Electronic Records Ltd.
Y. Kim. 1990. Effects of conceptual datamodelling fomalsms on user validationand analyst modelling of informationrequirements. Ph.D. thesis, University ofMinnesota.
Health Update _______________________________________________________________________________________ Health Update - Understanding Colds and Flu: Their Prevention and Treatment Did you catch our Internet Radio Show on “Cold & Flu Prevention and Treatment”? It is available in the webcast archives at www.healthcoach.ca/radio/. Many medical experts consider the influenza virus (ca
A New Assault on Addiction motivated to take it. "If this drug isn't used with a comprehensive treatment program," Medicine: Can a single drug keep alcoholics on the says DuPont Merck president Kurt Landgraf, “the failure rates are very high." And wagon and help junkies through withdrawal? naltrexone poses hazards of its own. The common side effects are minor, ranging from